KR20200014723A

KR20200014723A - Model training method and data similarity determination method, apparatus and device thereof

Info

Publication number: KR20200014723A
Application number: KR1020197023923A
Authority: KR
Inventors: 난 지앙; 홍웨이 자오
Original assignee: 알리바바 그룹 홀딩 리미티드
Priority date: 2017-07-19
Filing date: 2018-07-19
Publication date: 2020-02-11
Also published as: EP3611657A1; TW201909005A; EP3611657A4; US11288599B2; US20200167693A1; SG11201907257SA; PH12019501851A1; KR102349908B1; CN107609461A; WO2019015641A1; TWI735782B; JP2020524315A; JP6883661B2; US20200012969A1

Abstract

본 출원의 실시예는 모델 훈련 방법, 장치 및 디바이스, 및 데이터 유사성 결정 방법, 장치 및 디바이스를 개시한다. 모델 훈련 방법은, 복수의 사용자 데이터 쌍을 취득하는 단계 - 각각의 사용자 데이터 쌍 내의 두 세트의 사용자 데이터의 데이터 필드는 동일한 부분을 가짐 -; 각각의 사용자 데이터 쌍에 대응하는 사용자 유사성을 취득하는 단계 - 사용자 유사성은 각각의 사용자 데이터 쌍 내의 두 세트의 사용자 데이터에 대응하는 사용자 간의 유사성임 -; 각각의 사용자 데이터 쌍에 대응하는 사용자 유사성 및 복수의 사용자 데이터 쌍에 따라, 미리 설정된 분류 모델을 훈련시키기 위한 샘플 데이터를 결정하는 단계; 및 유사성 분류 모델을 획득하기 위해 샘플 데이터에 기초하여 분류 모델을 훈련시키는 단계를 포함한다. 본 출원의 실시예에 의해, 모델의 빠른 훈련이 구현될 수 있고, 모델 훈련 효율이 개선될 수 있으며, 자원 소비가 감소될 수 있다.Embodiments of the present application disclose model training methods, apparatus and devices, and data similarity determination methods, apparatuses and devices. The model training method includes obtaining a plurality of user data pairs, wherein the data fields of two sets of user data within each user data pair have the same portion; Obtaining user similarity corresponding to each user data pair, wherein the user similarity is similarity between users corresponding to two sets of user data in each user data pair; Determining sample data for training a preset classification model according to a plurality of user data pairs and user similarities corresponding to each pair of user data; And training the classification model based on the sample data to obtain a similarity classification model. By the embodiment of the present application, fast training of the model can be implemented, model training efficiency can be improved, and resource consumption can be reduced.

Description

Model training method and data similarity determination method, apparatus and device thereof

본 발명은 컴퓨터 기술 분야에 관한 것으로, 특히 모델 훈련 방법, 장치 및 디바이스, 및 데이터 유사성 결정 방법, 장치 및 디바이스에 관한 것이다.TECHNICAL FIELD The present invention relates to the field of computer technology, and more particularly, to a model training method, apparatus and device, and a method, apparatus and device for determining data similarity.

새로운 신원 검증 방법으로서, 얼굴 인식은 사용자에게 편의를 제공하면서 새로운 위험을 만들어 냈다. (쌍둥이와 같이) 매우 유사한 외모를 갖는 복수의 사용자의 경우, 얼굴 인식을 통해 서로 다른 사용자를 효과적으로 구별하기가 어렵고, 얼굴 인식은 사용자를 올바르게 식별하지 못함으로 인해 계정의 잘못 등록 및 계정 자금의 부정 유용의 위험을 초래할 가능성이 매우 높다. 매우 유사한 외모와 관련된 알려진 가장 전형적인 사례로서, 쌍둥이, 특히 일란성 쌍둥이는 서로 밀접하게 관련되어 있으며 위와 같은 위험과 연관된 행동에 결부될 가능성이 아주 높다. 대량의 데이터로부터 쌍둥이의 사용자 데이터를 결정하는 방법은 해결해야 할 중요한 문제가 되었다.As a new identity verification method, face recognition created new risks while providing convenience to the user. In the case of multiple users with very similar appearances (such as twins), it is difficult to effectively distinguish between different users through face recognition, and face recognition is a mistake in account registration and account fraud due to incorrect identification of users. It is very likely to cause a risk of usefulness. In the most known case of very similar appearances, twins, especially identical twins, are closely related to each other and are very likely to be associated with such risk-related behavior. Determining twin user data from large amounts of data has become an important problem to solve.

일반적으로, 감독된 머신 학습 방법(supervised machine learning method)에 기초하여, 미리 선택된 샘플 데이터를 사용하여 인식 모델이 구축된다. 구체적으로, 조사관은 설문지, 입상 질문(prize-winning question) 또는 수작업 관찰을 통해 사회적 조사를 수행하고, 사용자 데이터를 수집하며, 수작업 관찰을 통해 또는 사람들에게 조사를 받도록 요청함으로써 사용자 간의 연관성 또는 쌍둥이 관계를 획득하고 라벨을 붙여 분류한다. 수작업으로 분류된 연관성 또는 쌍둥이 관계에 기초하여, 대응하는 사용자 데이터를 샘플 데이터로 사용하여 식별 모델이 구축된다.In general, based on a supervised machine learning method, a recognition model is constructed using preselected sample data. Specifically, investigators conduct social investigations through questionnaires, prize-winning questions, or manual observations, collect user data, ask for investigations through manual observations, or ask people to be investigated, or twin relationships. Acquire and label it. Based on manually sorted associations or twin relationships, an identification model is built using the corresponding user data as sample data.

그러나, 감독된 머신 학습 방법을 사용하여 구축된 위에서 언급한 식별 모델은 샘플 데이터의 수작업 분류(manual labeling)를 필요로 하고, 수작업 분류 프로세스는 대량의 인력 자원을 소비하고, 또한 분류를 위해 많은 시간을 소비하므로, 모델 훈련을 비효율적으로 만들고 높은 자원 소비로 이어진다.However, the above-mentioned identification model built using supervised machine learning methods requires manual labeling of the sample data, and the manual classification process consumes a large amount of human resources and also requires a lot of time for classification. This makes the model training inefficient and leads to high resource consumption.

본 출원의 실시예의 목적은 모델의 빠른 훈련을 구현하고, 모델 훈련 효율을 개선하며, 자원 소비를 감소시키기 위한 모델 훈련 방법, 장치 및 디바이스, 및 데이터 유사성 결정 방법, 장치 및 디바이스를 제공하는 것이다.It is an object of embodiments of the present application to provide a model training method, apparatus and device, and data similarity determination method, apparatus and device for implementing fast training of a model, improving model training efficiency, and reducing resource consumption.

위에서 언급한 기술적 과제를 해결하기 위해, 본 출원의 실시예는 다음과 같이 구현된다:In order to solve the above-mentioned technical problem, an embodiment of the present application is implemented as follows:

본 출원의 실시예는 모델 훈련 방법을 제공하는 것으로, An embodiment of the present application to provide a model training method,

복수의 사용자 데이터 쌍을 취득하는 단계 - 각각의 사용자 데이터 쌍 내의 두 세트의 사용자 데이터의 데이터 필드는 동일한 부분을 가짐 -;Obtaining a plurality of user data pairs, wherein the data fields of two sets of user data in each user data pair have the same portion;

각각의 사용자 데이터 쌍에 대응하는 사용자 유사성을 취득하는 단계 - 사용자 유사성은 각각의 사용자 데이터 쌍 내의 두 세트의 사용자 데이터에 대응하는 사용자 간의 유사성임 -;Obtaining user similarity corresponding to each user data pair, wherein the user similarity is similarity between users corresponding to two sets of user data in each user data pair;

각각의 사용자 데이터 쌍에 대응하는 사용자 유사성 및 복수의 사용자 데이터 쌍에 따라, 미리 설정된 분류 모델(classification model)을 훈련시키기 위한 샘플 데이터를 결정하는 단계; 및Determining sample data for training a preset classification model according to a plurality of user data pairs and user similarities corresponding to each user data pair; And

유사성 분류 모델을 획득하기 위해 샘플 데이터에 기초하여 분류 모델을 훈련시키는 단계를 포함한다.Training the classification model based on the sample data to obtain a similarity classification model.

선택적으로(optionally), 각각의 사용자 데이터 쌍에 대응하는 사용자 유사성을 취득하는 단계는,Optionally, acquiring user similarity corresponding to each user data pair may include:

제1 사용자 데이터 쌍에 대응하는 사용자의 생물학적 특징을 취득하는 단계 - 제1 사용자 데이터 쌍은 복수의 사용자 데이터 쌍 중의 임의의 사용자 데이터 쌍임 -; 및Acquiring a biological characteristic of a user corresponding to the first user data pair, wherein the first user data pair is any user data pair of the plurality of user data pairs; And

제1 사용자 데이터 쌍에 대응하는 사용자의 생물학적 특징에 따라 제1 사용자 데이터 쌍에 대응하는 사용자 유사성을 결정하는 단계를 포함한다.Determining user similarity corresponding to the first user data pair according to a biological characteristic of the user corresponding to the first user data pair.

선택적으로, 생물학적 특징은 얼굴 이미지 특징을 포함하고;Optionally, the biological feature comprises a face image feature;

제1 사용자 데이터 쌍에 대응하는 사용자의 생물학적 특징을 취득하는 단계는,Acquiring a biological characteristic of a user corresponding to the first user data pair may include:

제1 사용자 데이터 쌍에 대응하는 사용자의 얼굴 이미지를 취득하는 단계; 및Obtaining a face image of a user corresponding to the first user data pair; And

얼굴 이미지 특징을 획득하기 위해 얼굴 이미지에 대해 특징 추출을 수행하는 단계를 포함하고,Performing feature extraction on the face image to obtain a face image feature,

이에 대응하여, 제1 사용자 데이터 쌍에 대응하는 사용자의 생물학적 특징에 따라 제1 사용자 데이터 쌍에 대응하는 사용자 유사성을 결정하는 단계는,Correspondingly, determining user similarity corresponding to the first user data pair according to the biological characteristics of the user corresponding to the first user data pair,

제1 사용자 데이터 쌍에 대응하는 사용자의 얼굴 이미지 특징에 따라 제1 사용자 데이터 쌍에 대응하는 사용자 유사성을 결정하는 단계를 포함한다.Determining user similarity corresponding to the first user data pair according to a facial image feature of the user corresponding to the first user data pair.

선택적으로, 생물학적 특징은 음성 특징을 포함하고;Optionally, the biological features include negative features;

제1 사용자 데이터 쌍에 대응하는 사용자의 음성 데이터를 취득하는 단계; 및Acquiring voice data of a user corresponding to the first user data pair; And

음성 특징을 획득하기 위해 음성 데이터에 대해 특징 추출을 수행하는 단계를 포함하고;Performing feature extraction on speech data to obtain a speech characteristic;

제1 사용자 데이터 쌍에 대응하는 사용자의 음성 특징에 따라 제1 사용자 데이터 쌍에 대응하는 사용자 유사성을 결정하는 단계를 포함한다.Determining user similarity corresponding to the first user data pair according to a voice characteristic of the user corresponding to the first user data pair.

선택적으로, 각각의 사용자 데이터 쌍에 대응하는 사용자 유사성 및 복수의 사용자 데이터 쌍에 따라, 분류 모델을 훈련시키기 위한 샘플 데이터를 결정하는 단계는,Optionally, determining sample data for training the classification model according to the user similarities and the plurality of user data pairs corresponding to each user data pair,

각각의 사용자 데이터 쌍 내의 두 세트의 사용자 데이터 사이의 연관된 사용자 특징을 획득하기 위해 복수의 사용자 데이터 쌍 내의 각각의 사용자 데이터 쌍에 대해 특징 추출을 수행하는 단계; 및Performing feature extraction on each user data pair in the plurality of user data pairs to obtain an associated user feature between two sets of user data in each user data pair; And

각각의 사용자 데이터 쌍 내의 사용자 데이터사이의 연관된 사용자 특징 및 각각의 사용자 데이터 쌍에 대응하는 사용자 유사성에 따라, 분류 모델을 훈련시키기 위한 샘플 데이터를 결정하는 단계를 포함한다.Determining sample data for training the classification model, in accordance with associated user characteristics between the user data in each user data pair and user similarity corresponding to each user data pair.

선택적으로, 각각의 사용자 데이터 쌍 내의 두 세트의 사용자 데이터 사이의 연관된 사용자 특징 및 각각의 사용자 데이터 쌍에 대응하는 사용자 유사성에 따라, 분류 모델을 훈련시키기 위한 샘플 데이터를 결정하는 단계는,Optionally, according to the associated user characteristics between the two sets of user data in each user data pair and the user similarity corresponding to each user data pair, determining the sample data for training the classification model may include:

각각의 사용자 데이터 쌍에 대응하는 사용자 유사성 및 미리 결정된 유사성 문턱치에 따라 복수의 사용자 데이터 쌍에 대응하는 사용자 특징으로부터 포지티브(positive) 샘플 특징 및 네거티브(negative) 샘플 특징을 선택하는 단계; 및Selecting a positive sample feature and a negative sample feature from user features corresponding to the plurality of user data pairs according to a user similarity corresponding to each user data pair and a predetermined similarity threshold; And

포지티브 샘플 특징 및 네거티브 샘플 특징을 분류 모델을 훈련시키기 위한 샘플 데이터로서 사용하는 단계를 포함한다.Using the positive sample feature and the negative sample feature as sample data for training the classification model.

선택적으로, 사용자 특징은 세대 등록 차원 특징(household registration dimension feature), 이름 차원 특징, 사회적 특징 및 관심 특징을 포함하며, 여기서 세대 등록 차원 특징은 사용자 신원 정보의 특징을 포함하고, 이름 차원 특징은 사용자 이름 정보의 특징 및 사용자 성(surename)의 희소성의 정도의 특징을 포함하며, 사회적 특징은 사용자의 사회적 관계 정보의 특징을 포함한다.Optionally, the user feature includes a household registration dimension feature, a name dimension feature, a social feature, and a feature of interest, where the household registration dimension feature comprises a feature of user identity information, wherein the name dimension feature is a user The characteristics of the name information and the degree of scarcity of the user's surname, and the social characteristics include the characteristics of the user's social relation information.

선택적으로, 포지티브 샘플 특징은 네거티브 샘플 특징과 동일한 수량의 특징을 포함한다.Optionally, the positive sample feature includes the same quantity of features as the negative sample feature.

선택적으로, 유사성 분류 모델은 이진 분류기 모델(binary classifier model)이다.Optionally, the similarity classification model is a binary classifier model.

본 출원의 실시예는 추가로 데이터 유사성 결정 방법을 제공하는 것으로,An embodiment of the present application further provides a method for determining data similarity,

검출 대상 사용자 데이터 쌍(to-be-detected user data pair)을 취득하는 단계;Acquiring a to-be-detected user data pair;

검출 대상 사용자 데이터 쌍 내의 각각의 세트의 검출 대상 사용자 데이터에 대해 특징 추출을 수행하는 단계; 및Performing feature extraction on each set of detection target user data in the detection target user data pair; And

검출 대상 사용자 특징 및 미리 훈련된 유사성 분류 모델에 따라 검출 대상 사용자 데이터 쌍 내의 두 세트의 검출 대상 사용자 데이터에 대응하는 사용자 간의 유사성을 결정하는 단계를 포함한다.Determining similarity between the users corresponding to the two sets of detection user data in the detection user data pair according to the detection user feature and the pre-trained similarity classification model.

선택적으로, 방법은,Optionally, the method

검출 대상 사용자 데이터 쌍 내의 두 세트의 검출 대상 사용자 데이터에 대응하는 사용자 간의 유사성이 미리 결정된 유사성 분류 문턱치보다 크면, 검출 대상 사용자 데이터 쌍에 대응하는 검출 대상 사용자를 쌍둥이로 결정하는 단계를 더 포함한다.If the similarity between the users corresponding to the two sets of detection user data pairs in the detection user data pair is greater than a predetermined similarity classification threshold, determining the detection users corresponding to the detection user data pairs as twins.

본 출원의 실시예는 모델 훈련 장치를 제공하는 것으로,An embodiment of the present application to provide a model training apparatus,

복수의 사용자 데이터 쌍을 취득하도록 구성된 데이터 취득 모듈 - 각각의 사용자 데이터 쌍 내의 두 세트의 사용자 데이터의 데이터 필드는 동일한 부분을 가짐 -;A data acquisition module configured to acquire a plurality of user data pairs, the data fields of two sets of user data in each user data pair having the same portion;

각각의 사용자 데이터 쌍에 대응하는 사용자 유사성을 취득하도록 구성된 유사성 취득 모듈 - 사용자 유사성은 각각의 사용자 데이터 쌍 내의 두 세트의 사용자 데이터에 대응하는 사용자 간의 유사성임 -;A similarity obtaining module, configured to obtain user similarity corresponding to each user data pair, wherein the user similarity is similarity between users corresponding to two sets of user data in each user data pair;

각각의 사용자 데이터 쌍에 대응하는 사용자 유사성 및 복수의 사용자 데이터 쌍에 따라, 미리 설정된 분류 모델을 훈련시키기 위한 샘플 데이터를 결정하도록 구성된 샘플 데이터 결정 모듈; 및A sample data determination module configured to determine sample data for training a preset classification model according to a plurality of user data pairs and user similarities corresponding to each user data pair; And

유사성 분류 모델을 획득하기 위해 샘플 데이터에 기초하여 분류 모델을 훈련시키도록 구성된 모델 훈련 모듈을 포함한다.And a model training module configured to train the classification model based on the sample data to obtain a similarity classification model.

선택적으로, 유사성 취득 모듈은,Optionally, the similarity acquisition module may include:

제1 사용자 데이터 쌍에 대응하는 사용자의 생물학적 특징을 취득하도록 구성된 생물학적 특징 취득 유닛 - 제1 사용자 데이터 쌍은 복수의 사용자 데이터 쌍 중의 임의의 사용자 데이터 쌍임 -; 및A biological feature acquisition unit configured to acquire a biological feature of a user corresponding to the first user data pair, wherein the first user data pair is any user data pair of the plurality of user data pairs; And

제1 사용자 데이터 쌍에 대응하는 사용자의 생물학적 특징에 따라 제1 사용자 데이터 쌍에 대응하는 사용자 유사성을 결정하도록 구성된 유사성 취득 유닛을 포함한다.And a similarity obtaining unit, configured to determine user similarity corresponding to the first user data pair according to a biological characteristic of the user corresponding to the first user data pair.

생물학적 특징 취득 유닛은 제1 사용자 데이터 쌍에 대응하는 사용자의 얼굴 이미지를 취득하고; 얼굴 이미지 특징을 획득하기 위해 얼굴 이미지에 대해 특징 추출을 수행하도록 구성되며;The biological feature acquisition unit acquires a face image of the user corresponding to the first user data pair; Perform feature extraction on the face image to obtain a face image feature;

이에 대응하여, 유사성 취득 유닛은 제1 사용자 데이터 쌍에 대응하는 사용자의 얼굴 이미지 특징에 따라 제1 사용자 데이터 쌍에 대응하는 사용자 유사성을 결정하도록 구성된다.Correspondingly, the similarity obtaining unit is configured to determine the user similarity corresponding to the first user data pair according to the facial image feature of the user corresponding to the first user data pair.

생물학적 특징 취득 유닛은 제1 사용자 데이터 쌍에 대응하는 사용자의 음성 데이터를 획득하고; 음성 특징을 획득하기 위해 음성 데이터에 대해 특징 추출을 수행하도록 구성되며;The biological feature acquisition unit obtains voice data of a user corresponding to the first user data pair; Perform feature extraction on speech data to obtain a speech characteristic;

이에 대응하여, 유사성 취득 유닛은 제1 사용자 데이터 쌍에 대응하는 사용자의 음성 특징에 따라 제1 사용자 데이터 쌍에 대응하는 사용자 유사성을 결정하도록 구성된다.Correspondingly, the similarity obtaining unit is configured to determine the user similarity corresponding to the first user data pair according to the voice characteristic of the user corresponding to the first user data pair.

선택적으로, 샘플 데이터 결정 모듈은,Optionally, the sample data determination module,

각각의 사용자 데이터 쌍 내의 두 세트의 사용자 데이터 사이의 연관된 사용자 특징을 획득하기 위해 복수의 사용자 데이터 쌍 내의 각각의 사용자 데이터 쌍에 대해 특징 추출을 수행하도록 구성된 특징 추출 유닛; 및A feature extraction unit configured to perform feature extraction on each user data pair in the plurality of user data pairs to obtain an associated user feature between two sets of user data in each user data pair; And

각각의 사용자 데이터 쌍 내의 두 세트의 사용자 데이터 사이의 연관된 사용자 특징 및 각각의 사용자 데이터 쌍에 대응하는 사용자 유사성에 따라, 분류 모델을 훈련시키기 위한 샘플 데이터를 결정하도록 구성된 샘플 데이터 결정 유닛을 포함한다.And a sample data determination unit configured to determine sample data for training the classification model, according to associated user characteristics between two sets of user data in each user data pair and user similarity corresponding to each user data pair.

선택적으로, 샘플 데이터 결정 유닛은 각각의 사용자 데이터 쌍에 대응하는 사용자 유사성 및 미리 결정된 유사성 문턱치에 따라 복수의 사용자 데이터 쌍에 대응하는 사용자 특징으로부터 포지티브 샘플 특징 및 네거티브 샘플 특징을 선택하고; 포지티브 샘플 특징 및 네거티브 샘플 특징을 분류 모델을 훈련시키기 위한 샘플 데이터로 사용하도록 구성된다.Optionally, the sample data determination unit selects a positive sample feature and a negative sample feature from user features corresponding to the plurality of user data pairs according to user similarities and predetermined similarity thresholds corresponding to each user data pair; The positive sample feature and the negative sample feature are configured to use as sample data for training the classification model.

선택적으로, 사용자 특징은 세대 등록 차원 특징, 이름 차원 특징, 사회적 특징 및 관심 특징을 포함하며, 여기서 세대 등록 차원 특징은 사용자 신원 정보의 특징을 포함하고, 이름 차원 특징은 사용자 이름 정보의 특징 및 사용자 성의 희소성의 정도의 특징을 포함하며, 사회적 특징은 사용자의 사회적 관계 정보의 특징을 포함한다.Optionally, the user feature includes a household registration dimension feature, a name dimension feature, a social feature, and a feature of interest, wherein the household registration dimension feature includes a feature of user identity information, wherein the name dimension feature is a feature of the user name information and the user. Includes features of the degree of scarcity of sex, and social features include features of the user's social relationship information.

선택적으로, 유사성 분류 모델은 이진 분류기 모델이다.Optionally, the similarity classification model is a binary classifier model.

본 출원의 실시예는 추가로 데이터 유사성 결정 장치를 제공하는 것으로,An embodiment of the present application further provides an apparatus for determining data similarity,

검출 대상 사용자 데이터 쌍을 취득하도록 구성된 검출 대상 데이터 취득 모듈;A detection target data acquisition module configured to acquire a detection target user data pair;

검출 대상 사용자 특징을 획득하기 위해 검출 대상 사용자 데이터 쌍 내의 각각의 세트의 검출 대상 사용자 데이터에 대해 특징 추출을 수행하도록 구성된 특징 추출 모듈; 및A feature extraction module configured to perform feature extraction on each set of detection target user data in the detection target user data pair to obtain a detection target user feature; And

검출 대상 사용자 특징 및 미리 훈련된 유사성 분류 모델에 따라 검출 대상 사용자 데이터 쌍 내의 두 세트의 검출 대상 사용자 데이터에 대응하는 사용자 간의 유사성을 결정하도록 구성된 유사성 결정 모듈을 포함한다.And a similarity determination module configured to determine similarity between the users corresponding to the two sets of detected user data in the detected user data pair according to the detected user feature and the pre-trained similarity classification model.

선택적으로, 장치는,Optionally, the device is

검출 대상의 사용자 데이터 쌍 내의 두 세트의 검출 대상 사용자 데이터에 대응하는 사용자 간의 유사성이 미리 결정된 유사성 분류 문턱치보다 크면, 검출 대상의 사용자 데이터 쌍에 대응하는 검출 대상 사용자를 쌍둥이라고 결정하도록 구성된 유사성 분류 모듈을 더 포함한다.A similarity classification module configured to determine, as twins, a detected user corresponding to the detected user data pair as twin if the similarity between the users corresponding to the two sets of detected user data in the detected user data pair is greater than a predetermined similarity classification threshold. It includes more.

본 출원의 실시예는 모델 훈련 디바이스를 제공하는 것으로,An embodiment of the present application to provide a model training device,

프로세서; 및A processor; And

컴퓨터 실행 가능 명령어를 저장하도록 구성된 메모리를 포함하고, 실행될 때 컴퓨터 실행 가능 명령어는 프로세서로 하여금 다음의 동작:And a memory configured to store computer executable instructions, wherein when executed the computer executable instructions cause the processor to:

복수의 사용자 데이터 쌍을 취득하는 것 - 각각의 사용자 데이터 쌍 내의 두 세트의 사용자 데이터의 데이터 필드는 동일한 부분을 가짐 -;Obtaining a plurality of user data pairs, wherein the data fields of two sets of user data in each user data pair have the same portion;

각각의 사용자 데이터 쌍에 대응하는 사용자 유사성을 취득하는 것 - 사용자 유사성은 각각의 사용자 데이터 쌍 내의 두 세트의 사용자 데이터에 대응하는 사용자 간의 유사성임 -;Obtaining user similarity corresponding to each user data pair, wherein the user similarity is similarity between users corresponding to two sets of user data in each user data pair;

각각의 사용자 데이터 쌍에 대응하는 사용자 유사성 및 복수의 사용자 데이터 쌍에 따라, 미리 설정된 분류 모델을 훈련시키기 위한 샘플 데이터를 결정하는 것; 및Determining sample data for training a preset classification model, according to a plurality of user data pairs and user similarities corresponding to each pair of user data; And

유사성 분류 모델을 획득하기 위해 샘플 데이터에 기초하여 분류 모델을 훈련시키는 것을 실행하게 한다.Train the classification model based on the sample data to obtain a similarity classification model.

본 출원의 실시예는 데이터 유사성 결정 디바이스를 제공하는 것으로,An embodiment of the present application provides a data similarity determination device,

프로세서; 및A processor; And

컴퓨터 실행 가능 명령어를 저장하도록 구성된 메모리를 포함하고, 실행될 때 컴퓨터 실행 가능 명령어는 프로세서로 하여금 다음 동작:And a memory configured to store computer executable instructions, wherein when executed the computer executable instructions cause the processor to:

검출 대상 사용자 데이터 쌍을 취득하는 것;Acquiring a detection target user data pair;

검출 대상 사용자 특징을 획득하기 위해 검출 대상 사용자 데이터 쌍 내의 각각의 검출 대상 사용자 데이터에 대해 특징 추출을 수행하는 것; 및Performing feature extraction on each detected user data in the detected user data pair to obtain a detected user feature; And

검출 대상 사용자 특징 및 미리 훈련된 유사성 분류 모델에 따라 검출 대상 사용자 데이터 쌍 내의 두 세트의 검출 대상 사용자 데이터에 대응하는 사용자 간의 유사성을 결정하는 것을 실행하게 한다.Determining similarity between the users corresponding to the two sets of detected user data in the detected user data pair according to the detected user feature and the pre-trained similarity classification model.

본 출원의 실시예에 의해 제공되는 기술적 해결책으로부터 알 수 있는 바와 같이, 본 출원의 실시예에서, 복수의 사용자 데이터 쌍이 취득되고 - 각각의 사용자 데이터 쌍 내의 두 세트의 사용자 데이터의 데이터 필드는 동일한 부분을 가짐 -; 각각의 사용자 데이터 쌍에 대응하는 사용자 유사성이 취득되고; 미리 설정된 분류 모델을 훈련시키기 위한 샘플 데이터가 결정되고; 그런 다음 분류 모델이 샘플 데이터에 기초하여 훈련되어 유사성 분류 모델을 획득하므로, 검출 대상 사용자 데이터 쌍 내의 두 세트의 검출 대상 사용자 데이터에 대응하는 사용자 간의 유사성이 유사성 분류 모델에 따라 결정될 수 있다. 이러한 방식으로, 복수의 사용자 데이터 쌍이 동일한 데이터 필드를 통해 획득되고, 미리 설정된 분류 모델을 훈련시키기 위한 샘플 데이터를 획득하기 위해 각각의 사용자 데이터 쌍 내의 두 세트의 사용자 데이터에 대응하는 사용자 간의 연관성이 사용자 유사성에 따라 결정되므로, 즉, 샘플 데이터는 수작업 분류없이 획득될 수 있어서, 모델의 빠른 훈련이 구현될 수 있고, 모델 훈련 효율이 개선될 수 있으며, 자원 소비가 감소될 수 있다.As can be seen from the technical solution provided by the embodiments of the present application, in the embodiments of the present application, a plurality of user data pairs are obtained-the data fields of two sets of user data in each user data pair are the same part. Has-; User similarity corresponding to each user data pair is obtained; Sample data for training a preset classification model is determined; Then, since the classification model is trained based on the sample data to obtain a similarity classification model, the similarity between the users corresponding to the two sets of detection user data in the detection user data pair may be determined according to the similarity classification model. In this manner, a plurality of user data pairs are obtained through the same data field, and the association between the users corresponding to two sets of user data in each user data pair to obtain sample data for training a preset classification model. As determined by the similarity, that is, sample data can be obtained without manual classification, so that fast training of the model can be implemented, model training efficiency can be improved, and resource consumption can be reduced.

본 출원의 실시예의 기술적 해결책 또는 종래 기술을 보다 명확하게 설명하기 위해, 실시예 또는 종래 기술의 설명에 사용될 필요가 있는 도면이 간략하게 소개된다. 다음의 설명에서 도면은 본 출원의 일부 실시예일뿐이며, 관련 기술분야에서 통상의 기술자는 창조적 노력없이 이들 도면에 따라 다른 도면을 추가로 획득할 수 있음이 분명할 것이다.
도 1은 본 출원에 따른 모델 훈련 방법의 실시예를 도시한다.
도 2는 본 출원에 따른 데이터 유사성 결정 방법의 실시예를 도시한다.
도 3은 본 출원에 따른 검출 애플리케이션의 인터페이스의 개략도이다.
도 4는 본 출원에 따른 데이터 유사성 결정 방법의 실시 예를 도시한다.
도 5는 본 출원에 따른 데이터 유사성 결정 프로세스의 처리 로직의 개략도이다.
도 6은 본 발명에 따른 모델 훈련 장치의 실시예를 도시한다.
도 7은 본 발명에 따른 데이터 유사성 결정 장치의 실시예를 도시한다.
도 8은 본 출원에 따른 모델 훈련 디바이스의 실시예를 도시한다.
도 9는 본 출원에 따른 데이터 유사성 결정 디바이스의 실시예를 도시한다.BRIEF DESCRIPTION OF DRAWINGS To describe the technical solutions in the embodiments of the present application or in the prior art more clearly, the drawings that need to be used in the description of the embodiments or the prior art are briefly introduced. In the following description the drawings are only some embodiments of the present application, and it will be apparent to one skilled in the art that additional drawings may be obtained according to these drawings without creative efforts.
1 illustrates an embodiment of a model training method according to the present application.
2 illustrates an embodiment of a data similarity determination method according to the present application.
3 is a schematic diagram of an interface of a detection application according to the present application.
4 illustrates an embodiment of a data similarity determination method according to the present application.
5 is a schematic diagram of processing logic of a data similarity determination process according to the present application.
6 shows an embodiment of a model training apparatus according to the present invention.
7 shows an embodiment of a data similarity determining apparatus according to the present invention.
8 shows an embodiment of a model training device according to the present application.
9 illustrates an embodiment of a data similarity determining device according to the present application.

본 출원의 실시예는 모델 훈련 방법, 장치 및 디바이스, 및 데이터 유사성 결정 방법, 장치 및 디바이스를 제공한다.Embodiments of the present application provide a model training method, apparatus and device, and a data similarity determination method, apparatus and device.

관련 기술분야에서 통상의 기술자가 본 출원의 기술적 해결책을 더 잘 이해할 수 있도록 하기 위해, 본 출원의 실시예의 기술적 해결책은 본 출원의 실시예의 첨부 도면을 참조하여 아래에서 명확하고 완전하게 설명될 것이다. 설명된 실시예는 본 출원의 모든 실시예가 아니라 일부일 뿐이라는 것이 분명할 것이다. 본 출원의 구현예에 기초하여 관련 기술분야에서 통상의 기술자에 의해 획득된 다른 모든 실시예는 본 출원의 보호 범위 내에 있다.BRIEF DESCRIPTION OF DRAWINGS To enable those skilled in the art to better understand the technical solutions in the present application, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application. It will be apparent that the described embodiments are only some but not all of the embodiments of the present application. All other embodiments obtained by those skilled in the art based on the embodiments of the present application are within the protection scope of the present application.

실시예 1Example 1

도 1에 도시된 바와 같이, 본 출원의 실시예는 모델 훈련 방법을 제공한다. 방법은 단말기 디바이스 또는 서버에 의해 수행될 수 있다. 단말기 디바이스는 개인용 컴퓨터 등일 수 있다. 서버는 독립적인 단일 서버일 수 있거나, 복수의 서버에 의해 형성된 서버 클러스터일 수 있다. 본 출원의 실시예는 모델 훈련 효율을 개선하기 위해 서버에 의해 방법이 실행되는 예를 사용하여 상세하게 설명된다. 방법은 구체적으로 다음의 단계를 포함할 수 있다:As shown in FIG. 1, an embodiment of the present application provides a model training method. The method may be performed by a terminal device or a server. The terminal device may be a personal computer or the like. The server may be a single independent server or may be a server cluster formed by a plurality of servers. Embodiments of the present application are described in detail using examples in which the method is executed by the server to improve model training efficiency. The method may specifically include the following steps:

단계(S102)에서, 복수의 사용자 데이터 쌍이 취득되며, 여기서 각각의 사용자 데이터 쌍 내의 두 세트의 사용자 데이터의 데이터 필드는 동일한 부분을 갖는다.In step S102, a plurality of user data pairs are obtained, wherein the data fields of two sets of user data in each user data pair have the same portion.

각각의 사용자 데이터 쌍은 복수의 서로 다른 사용자의 사용자 데이터를 포함할 수 있다. 예를 들어, 복수의 사용자 데이터 쌍은 사용자 데이터 쌍 A 및 사용자 데이터 쌍 B를 포함한다. 사용자 데이터 쌍 A는 사용자 데이터 1 및 사용자 데이터 2를 포함하고, 사용자 데이터 쌍 B는 사용자 데이터 3 및 사용자 데이터 4를 포함한다. 사용자 데이터는 사용자와 관련된 데이터, 예를 들면, 사용자의 이름, 연령, 신장, 주소, 신원 카드 번호, 사회 보장 카드 번호와 같은 신원 정보일 수 있으며, 사용자의 관심, 구매한 상품, 여행 등과 같은 정보를 또한 포함할 수 있다. 데이터 필드는 사용자 데이터 쌍 내의 두 세트의 서로 다른 사용자 데이터에 대응하는 사용자의 신원을 나타낼 수 있는 필드 또는 문자뿐만 아니라, 사용자 간의 연관성, 예를 들어, 성(surname), 신원 카드 번호의 미리 설정된 수량의 숫자(예를 들어, 신원 카드 번호의 처음 14 숫자), 사회 보장 카드 번호 또는 사용자 신원 또는 정보를 결정할 수 있는 다른 신원 번호 등일 수 있다.Each user data pair may include user data of a plurality of different users. For example, the plurality of user data pairs includes user data pair A and user data pair B. User data pair A includes user data 1 and user data 2, and user data pair B includes user data 3 and user data 4. User data may be data related to the user, for example, identity information such as the user's name, age, height, address, identity card number, social security card number, and information such as the user's interest, purchased goods, travel, etc. It may also include. A data field is a field or character that can represent a user's identity corresponding to two sets of different user data within a user data pair, as well as a predetermined quantity of associations between users, for example, surname, identity card number. May be a number (e.g., the first 14 digits of an identity card number), a social security card number, or another identity number from which a user identity or information may be determined.

구현예에서, 사용자 데이터는 다양한 방식으로 획득될 수 있다. 예를 들어, 사용자 데이터는 서로 다른 사용자로부터 구입될 수 있으며; 또는 웹 사이트 또는 애플리케이션에 등록할 때 사용자에 의해 입력된 정보, 예를 들면, 사용자가 Alipay®에 등록할 때 입력한 정보 또는 사용자에 의해 적극적으로 업로드된 사용자 데이터일 수 있다. 사용자 데이터가 취득되는 특정 방식은 본 출원의 실시예로 제한되지 않는다. 사용자 데이터가 취득된 이후에, 취득된 사용자 데이터에 포함된 데이터 필드는 동일한 부분을 공유하는 데이터 필드를 갖는 사용자 데이터를 찾기 위해 비교될 수 있다. 동일한 부분을 공유하는 데이터 필드를 갖는 사용자 데이터는 함께 그룹화되어 사용자 데이터 쌍을 형성할 수 있다. 위의 방법에 의해, 복수의 사용자 데이터 쌍이 획득될 수 있고, 각각의 사용자 데이터 쌍 내의 사용자 데이터의 데이터 필드는 동일한 부분을 갖는다.In implementations, user data may be obtained in a variety of ways. For example, user data can be purchased from different users; Or information entered by the user when registering with a website or application, such as information entered when the user registers with Alipay® or user data actively uploaded by the user. The specific manner in which user data is obtained is not limited to the embodiments of the present application. After the user data is acquired, the data fields included in the obtained user data can be compared to find user data having data fields that share the same portion. User data having data fields that share the same portion can be grouped together to form a pair of user data. By the above method, a plurality of user data pairs can be obtained, and the data fields of the user data in each user data pair have the same part.

예를 들어, 실제 응용에서, 가능한 한 계산량을 줄이고 처리 효율을 개선하기 위해, 데이터 필드는 신원 카드 번호 및 성으로서 설정될 수 있고, 신원 카드 번호 내의 하나 이상의 숫자, 예를 들면, 신원 카드 번호의 처음 14 숫자가 두 사용자 사이의 관계를 나타낼 수 있다는 것을 고려하여 신원 카드 번호 및 사용자 이름과 같은 정보는 사용자 데이터에서 검색될 수 있다. 본 출원의 실시예에서, 예로서, 신원 카드 번호의 처음 14 숫자는 데이터 필드가 동일한 부분을 갖는지를 결정하기 위한 기초로서 사용된다. 구체적으로, 신원 카드 번호의 처음 14 숫자와 각 사용자의 성이 취득될 수 있고, 신원 카드 번호의 처음 14 숫자와 서로 다른 사용자의 성이 비교될 수 있다. 성이 동일하고 신원 카드 번호의 첫 14 숫자가 동일한 두 세트의 사용자 데이터가 하나의 사용자 데이터 쌍으로 그룹화될 수 있다. 구체적으로, 사용자 데이터 쌍은 사용자 쌍의 형태로, 예를 들면, {사용자 1의 신원 카드 번호, 사용자 2의 신원 카드 번호, 사용자 1의 이름, 사용자 2의 이름, 사용자 1의 다른 데이터, 사용자 2의 다른 데이터} 등의 형태로 저장될 수 있다.For example, in practical applications, in order to reduce the amount of computation and improve the processing efficiency as much as possible, the data field may be set as an identity card number and a surname, and one or more numbers in the identity card number, for example, of the identity card number Given that the first 14 digits may represent a relationship between two users, information such as identity card numbers and user names may be retrieved from the user data. In an embodiment of the present application, as an example, the first 14 digits of the identity card number are used as the basis for determining whether the data field has the same part. Specifically, the first 14 digits of the identity card number and each user's last name can be obtained, and the first 14 digits of the identity card number can be compared with the last user's last name. Two sets of user data with the same last name and the same first 14 digits of the identity card number can be grouped into one user data pair. Specifically, the user data pair is in the form of a user pair, for example, {identity card number of user 1, identity card number of user 2, name of user 1, name of user 2, other data of user 1, user 2 And other data}.

동일한 부분을 갖는 두 세트의 사용자 데이터의 데이터 필드는 데이터 필드 내의 일부 내용, 예를 들면, 18-숫자 신원 카드 번호 중 처음 14 숫자가 동일하다고 해석될 수 있거나, 또는 데이터 필드의 모든 내용이 동일한 것으로 해석될 수 있다는 것을 알아야 한다.A data field of two sets of user data having the same part may be interpreted as some content in the data field, for example, the first 14 digits of an 18-digit identity card number, or all the content of the data field is the same. It should be understood that it can be interpreted.

단계(S104)에서, 각각의 사용자 데이터 쌍에 대응하는 사용자 유사성이 취득되며, 여기서 사용자 유사성은 각각의 사용자 데이터 쌍 내의 두 세트의 사용자 데이터에 대응하는 사용자 간의 유사성이다.In step S104, user similarity corresponding to each user data pair is obtained, where user similarity is similarity between users corresponding to two sets of user data in each user data pair.

사용자 유사성은 복수의 사용자 간의 유사성의 정도, 예를 들어, 99 % 또는 50 %를 나타내는 데 사용될 수 있다. 실제 응용에서, 사용자 유사성은 또한 다른 방식으로도 표현될 수 있다. 예를 들어, 사용자 유사성은 또한 쌍둥이 및 쌍둥이 아님, 또는 일란성 쌍둥이 및 이란성 쌍둥이로 나타낼 수 있다.User similarity can be used to indicate the degree of similarity between a plurality of users, for example 99% or 50%. In practical applications, user similarity can also be expressed in other ways. For example, user similarity can also be expressed as twins and not twins, or identical twins and fraternal twins.

구현예에서, 본 실시예의 주요 목적은 샘플 데이터에 대응하는 사용자 유사성뿐만 아니라, 분류 모델을 훈련시키기 위한 샘플 데이터를 필요로 하는 분류 모델을 훈련시키는 것이다. 사용자 유사성은 서버 또는 단말기 디바이스에 미리 저장될 수 있다. 사용자 유사성은 다양한 방식으로 결정될 수 있다. 하나의 선택적인 처리 방법이 아래에서 제공되며, 자세한 사항에 대해서는 다음과 같은 내용이 참조될 수 있다: 사용자의 이미지가 사전에 취득될 수 있으며, 여기서 이미지는 애플리케이션 또는 웹 사이트에 등록할 때 사용자에 의해 업로드될 수 있고, 사용자는 각각의 사용자 데이터 쌍에 포함된 두 세트의 사용자 데이터에 대응하는 사용자일 수 있다. 각각의 사용자 데이터 쌍의 이미지가 비교될 수 있고, 이미지의 비교를 통해 사용자 데이터 쌍에 포함된 두 세트의 사용자 데이터에 대응하는 사용자 간의 유사성이 계산될 수 있다. 이미지 비교 프로세스 동안, 이미지 전처리, 이미지 특징 추출 및 이미지 특징 비교와 같은 처리 방법이 사용될 수 있으며, 이것은 본 출원의 실시예로 제한되지 않는다.In an embodiment, the main purpose of this embodiment is to train a classification model that requires sample data for training the classification model, as well as user similarity corresponding to the sample data. User similarity may be stored in advance on the server or terminal device. User similarity can be determined in various ways. One optional method of processing is provided below, and for details, reference may be made to the following: An image of the user may be acquired in advance, where the image is displayed to the user when registering with the application or website. Uploaded by the user, and the user may be a user corresponding to two sets of user data included in each user data pair. Images of each user data pair may be compared, and similarity between the users corresponding to two sets of user data included in the user data pair may be calculated through the comparison of the images. During the image comparison process, processing methods such as image preprocessing, image feature extraction and image feature comparison can be used, which is not limited to the embodiments of the present application.

단계(S106)에서, 각각의 사용자 데이터 쌍에 대응하는 사용자 유사성 및 복수의 사용자 데이터 쌍에 따라, 미리 설정된 분류 모델을 훈련시키기 위한 샘플 데이터가 결정된다.In step S106, according to the user similarity and the plurality of user data pairs corresponding to each user data pair, sample data for training a preset classification model is determined.

분류 모델은 나이브 베이지안 분류 모델(naive Bayesian classification model), 로지스틱 회귀 분류 모델(logistic regression classification model), 의사 결정 트리 분류 모델(decision tree classification model) 또는 지원 벡터 머신 분류 모델(support vector machine classification model)과 같은 임의의 분류 모델일 수 있다. 본 출원의 실시예에서, 분류 모델이 서로 다른 두 사용자가 유사한지를 결정하기 위해서만 사용된다는 것을 고려하면, 분류 모델은 이진 분류 모델(binary classification model)일 수 있다. 샘플 데이터는 분류 모델을 훈련시키기 위해 사용되는 데이터일 수 있다. 샘플 데이터는 사용자 데이터 쌍 내의 두 세트의 사용자 데이터일 수 있고, 또한 사용자 데이터가 특정 방식으로 처리된 이후에 획득되는 데이터일 수도 있다. 예를 들어, 대응 사용자 특징을 획득하기 위해 위의 사용자 데이터에 대해 특징 추출이 수행되고, 사용자 특징의 데이터가 샘플 데이터로서 사용될 수 있다.Classification models include naive Bayesian classification models, logistic regression classification models, decision tree classification models, or support vector machine classification models. The same may be any classification model. In an embodiment of the present application, considering that the classification model is used only to determine whether two different users are similar, the classification model may be a binary classification model. Sample data may be data used to train a classification model. The sample data may be two sets of user data in the user data pair, or may be data obtained after the user data has been processed in a particular manner. For example, feature extraction may be performed on the above user data to obtain a corresponding user feature, and the data of the user feature may be used as sample data.

구현예에서, 예를 들어, 80 % 또는 70 %의 유사성 문턱치가 사전에 설정될 수 있다. 각각의 사용자 데이터 쌍에 대응하는 사용자 유사성은 각각 유사성 문턱치와 비교될 수 있다. 유사성 문턱치보다 큰 사용자 유사성에 대응하는 사용자 데이터 쌍은 하나의 세트로 그룹화될 수 있고, 유사성 문턱치보다 작은 사용자 유사성에 대응하는 사용자 데이터 쌍은 하나의 세트로 그룹화될 수 있고, 미리 정해진 수량(예를 들어, 40000 또는 50000)의 사용자 데이터 쌍이 위의 각각의 두 세트로부터 선택될 수 있고, 선택된 사용자 데이터 쌍은 미리 설정된 분류 모델을 훈련시키기 위한 샘플 데이터로서 사용된다.In an embodiment, for example, a similarity threshold of 80% or 70% may be set in advance. User similarity corresponding to each user data pair may be compared with a similarity threshold, respectively. User data pairs corresponding to user similarities that are greater than the similarity threshold may be grouped into one set, and user data pairs corresponding to user similarities that are less than the similarity threshold may be grouped into one set, with a predetermined quantity (e.g., For example, 40000 or 50000 user data pairs may be selected from each of the two sets above, and the selected user data pairs are used as sample data for training a preset classification model.

미리 설정된 분류 모델을 훈련시키기 위한 샘플 데이터는 위의 방식과 다른 다양한 방식으로 선택될 수 있다는 것을 알아야 한다. 예를 들어, 각각의 사용자 데이터 쌍에 포함된 두 세트의 사용자 데이터의 특징이 추출되어 대응하는 사용자 특징을 획득한 다음, 사용자 특징은 각각의 사용자 데이터 쌍에 대응하는 사용자 유사성 및 유사성 문턱치에 따라 위의 두 세트로 그룹화할 수 있다. 두 세트의 사용자 특징의 데이터가 미리 설정된 분류 모델을 훈련시키기 위한 샘플 데이터로서 사용될 수 있다.It should be noted that the sample data for training the preset classification model can be selected in various ways different from the above. For example, features of two sets of user data included in each user data pair are extracted to obtain corresponding user features, and then the user features are located according to the user similarity and similarity thresholds corresponding to each user data pair. Can be grouped into two sets. Two sets of user characteristic data can be used as sample data for training a preset classification model.

단계(S108)에서, 분류 모델은 샘플 데이터에 기초하여 훈련되어 유사성 분류 모델을 획득한다.In step S108, the classification model is trained based on the sample data to obtain a similarity classification model.

유사성 분류 모델은 서로 다른 사용자 간의 유사성의 정도를 결정하는 데 사용되는 모델일 수 있다.The similarity classification model may be a model used to determine the degree of similarity between different users.

구현예에서, 위에서 선택된 사용자 데이터 쌍이 미리 설정된 분류 모델을 훈련시키기 위한 샘플 데이터로서 사용되는 경우에 기초하여, 각각의 선택된 사용자 데이터 쌍 내의 두 세트의 사용자 데이터에 대해 특징 추출이 수행되어 대응하는 사용자 특징을 획득한 다음, 샘플 데이터의 각각의 사용자 데이터 쌍의 사용자 특징이 계산을 위해 분류 모델에 입력될 수 있다. 계산 이후에, 계산 결과가 출력될 수 있다. 계산 결과는 대응하는 사용자 데이터 쌍에 대응하는 사용자 유사성과 비교되어 둘이 동일한지를 결정할 수 있다. 둘이 동일하지 않으면, 분류 모델의 관련된 파라미터가 변경될 수 있고, 그런 다음 사용자 데이터 쌍의 사용자 특징이 수정된 분류 모델에 입력되어 계산되고, 계산 결과가 사용자 유사성과 동일한지가 결정된다. 둘이 동일할 때까지 절차가 반복된다. 둘이 동일하면, 위의 처리 절차는 다음으로 선택된 사용자 데이터 쌍에 대해 수행될 수 있다. 마지막으로, 각각의 사용자 데이터 쌍의 사용자 특징이 분류 모델에 입력된 이후에 획득된 계산 결과가 대응하는 사용자 데이터 쌍에 대응하는 사용자 유사성과 동일하면, 획득된 분류 모델은 유사성 분류 모델이다.In an embodiment, feature extraction is performed on two sets of user data in each selected user data pair based on the case where the user data pair selected above is used as sample data for training a preset classification model, thereby corresponding user features. Then, the user characteristics of each pair of user data of the sample data can be input to the classification model for calculation. After the calculation, the calculation result can be output. The calculation result can be compared with the user similarity corresponding to the corresponding user data pair to determine whether the two are the same. If the two are not the same, the relevant parameters of the classification model can be changed, and then the user characteristics of the user data pair are entered into the modified classification model and calculated, and it is determined whether the calculation result is equal to the user similarity. The procedure is repeated until both are identical. If both are the same, the above processing procedure may be performed for the next selected user data pair. Finally, if the calculation result obtained after the user feature of each user data pair is input to the classification model is the same as the user similarity corresponding to the corresponding user data pair, the obtained classification model is a similarity classification model.

위의 방법에 의해, 유사성 분류 모델이 획득될 수 있다. 유사성 분류 모델을 사용하기 위해, 다음과 같은 관련된 내용이 참조될 수 있다:By the above method, a similarity classification model can be obtained. To use the similarity classification model, reference may be made to the following related contents:

도 2에 도시된 바와 같이, 본 출원의 실시예는 유사성 결정 방법을 제공한다. 방법은 단말기 디바이스 또는 서버에 의해 수행될 수 있다. 단말기 디바이스는 개인용 컴퓨터 등일 수 있다. 서버는 독립적인 단일 서버일 수 있거나, 복수의 서버에 의해 형성된 서버 클러스터일 수 있다. 방법은 구체적으로 다음의 단계를 포함할 수 있다:As shown in FIG. 2, an embodiment of the present application provides a method for determining similarity. The method may be performed by a terminal device or a server. The terminal device may be a personal computer or the like. The server may be a single independent server or may be a server cluster formed by a plurality of servers. The method may specifically include the following steps:

단계(S202)에서, 검출 대상 사용자 데이터 쌍이 취득된다.In step S202, a detection target user data pair is obtained.

검출 대상 사용자 데이터 쌍은 검출 대상의 두 명의 사용자의 사용자 데이터에 의해 형성된 사용자 데이터 쌍일 수 있다.The detection target user data pair may be a user data pair formed by user data of two users of the detection target.

구현예에서, 서로 다른 두 사용자 간의 유사성을 검출하기 위해, 대응하는 검출 애플리케이션이 설정될 수 있다. 도 3에 도시된 바와 같이, 검출 애플리케이션은 데이터를 업로드하기 위한 버튼을 포함할 수 있다. 서로 다른 두 사용자 간의 유사성이 검출되어야 할 때, 데이터를 업로드하기 위한 버튼이 눌려질 수 있다. 검출 애플리케이션은 데이터를 업로드하기 위한 프롬프트 박스(prompt box)를 팝업(pop up)시킬 수 있다. 데이터 업로더(data uploader)는 검출 대상 사용자 데이터 쌍의 데이터를 프롬프트 박스에 입력한 다음, 입력이 완료될 때 프롬프트 박스에서 확인 버튼을 탭(tap)할 수 있다. 검출 애플리케이션은 데이터 업로더에 의해 입력된 검출 대상 사용자 데이터 쌍을 취득할 수 있다. 검출 애플리케이션은 단말기 디바이스에 설치될 수 있거나 서버에 설치될 수 있다. 본 출원의 실시예에 의해 제공된 유사성 결정 방법이 서버에 의해 실행되고 검출 애플리케이션이 단말기 디바이스에 설치되면, 검출 애플리케이션은 검출 대상 사용자 데이터 쌍을 취득한 이후에 검출 대상 사용자 데이터 쌍을 서버에 전송할 수 있으므로, 서버는 검출 대상 사용자 데이터 쌍을 취득할 수 있다. 검출 애플리케이션이 서버에 설치되면, 서버는 검출 애플리케이션으로부터 검출 대상 사용자 데이터 쌍을 직접 취득할 수 있다.In an implementation, a corresponding detection application can be set up to detect similarities between two different users. As shown in FIG. 3, the detection application may include a button for uploading data. When similarity between two different users is to be detected, a button for uploading data may be pressed. The detection application may pop up a prompt box for uploading the data. The data uploader may input data of the detection target user data pair into the prompt box, and then tap the confirmation button in the prompt box when the input is completed. The detection application may acquire the detection target user data pair input by the data uploader. The detection application may be installed on the terminal device or may be installed on the server. When the similarity determination method provided by the embodiment of the present application is executed by the server and the detection application is installed in the terminal device, the detection application can transmit the detection target user data pair to the server after acquiring the detection target user data pair, The server can acquire the detection target user data pair. When the detection application is installed in the server, the server can directly acquire the detection target user data pair from the detection application.

단계(S204)에서, 검출 대상 사용자 특징을 획득하기 위해 검출 대상 사용자 데이터 쌍 내의 각 세트의 검출 대상 사용자 데이터에 대해 특징 추출이 수행된다.In step S204, feature extraction is performed on each set of detection target user data in the detection target user data pair to obtain the detection target user feature.

검출 대상 사용자 특징은 검출 대상 사용자의 사용자 데이터의 특징일 수 있다.The detection target user feature may be a feature of user data of the detection target user.

구현예에서, 검출 대상 사용자 데이터 쌍 내 각 세트의 검출 대상 사용자 데이터가 취득될 수 있다. 임의의 세트의 검출 대상 사용자 데이터에 대해, 미리 설정된 특징 추출 알고리즘을 사용함으로써 대응하는 특징이 검출 대상 사용자 데이터로부터 추출될 수 있고, 추출된 특징은 검출 대상 사용자에 대응하는 검출 대상 사용자 데이터로서 사용될 수 있다. 위의 방법에 의해, 검출 대상 사용자 데이터 쌍 내 각 세트의 검출 대상 사용자 데이터에 대응하는 검출 대상 사용자 특징이 획득될 수 있다.In an implementation, each set of detected user data in the detected user data pair may be acquired. For any set of detection user data, the corresponding feature can be extracted from the detection user data by using a preset feature extraction algorithm, and the extracted feature can be used as detection user data corresponding to the detection user. have. By the above method, the detection target user feature corresponding to each set of detection user data in the detection target user data pair can be obtained.

특징 추출 알고리즘은 사용자 데이터로부터 미리 결정된 특징을 추출할 수 있는 임의의 알고리즘일 수 있고, 구체적으로는 실제 상황에 따라 설정될 수 있다는 것을 알아야 한다.It should be appreciated that the feature extraction algorithm can be any algorithm that can extract a predetermined feature from the user data, and specifically can be set according to the actual situation.

단계(S206)에서, 검출 대상 사용자 특징 및 미리 훈련된 유사성 분류 모델에 따라 검출 대상 사용자 데이터 쌍 내의 두 세트의 검출 대상 사용자 데이터에 대응하는 사용자 간의 유사성이 결정된다.In step S206, the similarity between the users corresponding to the two sets of the detected user data in the detected user data pair is determined according to the detected user feature and the pre-trained similarity classification model.

구현예에서, 단계(S204)를 통해 획득된 검출 대상 사용자 특징은 단계(S102) 내지 단계(S108)를 통해 획득된 유사성 분류 모델에 입력되어 계산될 수 있다. 유사성 분류 모델로부터 출력된 결과는 검출 대상 사용자 데이터 쌍 내의 두 세트의 검출 대상 사용자 데이터에 대응하는 사용자 간의 유사성일 수 있다.In an embodiment, the detected target user feature obtained through step S204 may be input and calculated in the similarity classification model obtained through steps S102 through S108. The result output from the similarity classification model may be the similarity between the users corresponding to the two sets of detection user data in the detection user data pair.

실제 응용에서, 유사성 분류 모델의 직접 출력 결과는 백분율, 예를 들어, 90 % 또는 40 %로 제시될 수 있다는 것을 알아야 한다. 출력 결과를 사용자에게 보다 직관적이게 하기 위해, 유사성 분류 모델의 직접 출력 결과는 일란성 쌍둥이 및 일란성이 아닌 쌍둥이가 구별될 필요가 있을 때, 또는 일란성 쌍둥이 및 이란성 쌍둥이가 구별될 필요가 있을 때와 같은 실제 상황에 따라 추가로 설정될 수 있다. 위의 경우를 고려하여, 분류 문턱치가 설정될 수 있다. 직접 출력 결과가 분류 문턱치보다 크면, 검출 대상 사용자 데이터 쌍 내의 두 세트의 검출 대상 사용자 데이터에 대응하는 사용자는 일란성 쌍둥이인 것으로 결정되고; 그렇지 않다면, 사용자는 일란성이 아닌 쌍둥이 또는 이란성 쌍둥이로 결정된다. 이러한 방식으로, 검출 대상 사용자 데이터 쌍 내의 두 세트의 검출 대상 사용자 데이터에 대응하는 사용자 간의 유사성이 미리 훈련된 유사성 분류 모델에 따라 신속하게 결정될 수 있고, 이에 의해 사용자 간의 유사성을 결정하는 효율이 개선될 수 있다. In practical applications, it should be appreciated that the direct output of the similarity classification model can be presented in percentage, for example 90% or 40%. In order to make the output result more intuitive to the user, the direct output result of the similarity classification model is the same as when the identical twins and non-identical twins need to be distinguished, or when the identical twins and fraternal twins need to be distinguished. It may be further set according to the situation. In consideration of the above case, the classification threshold may be set. If the direct output result is greater than the classification threshold, it is determined that the users corresponding to the two sets of detected user data in the detected user data pair are identical twins; If not, the user is determined to be identical or not fraternal twins. In this way, the similarity between the users corresponding to the two sets of the detected user data pairs in the detected user data pair can be quickly determined according to the pre-trained similarity classification model, thereby improving the efficiency of determining the similarity between users. Can be.

전술한 설명에서 사용자 데이터 쌍과 검출 대상 사용자 데이터 쌍은 둘 다 두 세트의 사용자 데이터를 포함하며, 실제 응용에서, 본 출원에 의해 제공되는 모델 훈련 방법 및 유사성 결정 방법은 또한 두 세트보다 많은 사용자 데이터를 포함하는 사용자 데이터 조합 및 검출 대상 사용자 데이터 조합에도 적용될 수 있다는 것을 알아야 한다. 특정 구현예에 대해, 본 출원의 실시예의 관련된 내용이 참조될 수 있으며, 세부 사항은 여기서 다시 설명되지 않는다.In the above description, both the user data pair and the detected user data pair both comprise two sets of user data, and in practical applications, the model training method and the similarity determination method provided by the present application may also have more than two sets of user data. It should be appreciated that the present invention may also be applied to a combination of user data and a detected target user data combination. For specific embodiments, reference may be made to the relevant content of embodiments of the present application, and details are not described herein again.

본 출원의 실시예는 복수의 사용자 데이터 쌍이 취득되는 모델 훈련 방법 및 유사성 결정 방법을 제공하는 것으로, 여기서 각각의 사용자 데이터 쌍 내의 두 세트의 사용자 데이터의 데이터 필드는 동일한 부분을 갖고; 각각의 사용자 데이터 쌍에 대응하는 사용자 유사성이 취득되고; 미리 설정된 분류 모델을 훈련시키기 위한 샘플 데이터가 결정되며; 그런 다음, 분류 모델이 유사성 분류 모델을 획득하기 위해 샘플 데이터에 기초하여 훈련되므로, 검출 대상 사용자 데이터 쌍 내의 두 세트의 검출 대상 사용자 데이터에 대응하는 사용자 간의 유사성이 유사성 분류 모델에 따라 결정될 수 있다. 이러한 방식으로, 복수의 사용자 데이터 쌍이 동일한 데이터 필드를 통해서만 획득되고, 미리 설정된 분류 모델을 훈련시키기 위한 샘플 데이터를 획득하기 위해 각각의 사용자 데이터 쌍 내의 두 세트의 사용자 데이터에 대응하는 사용자 간의 연관성이 사용자 유사성에 따라 결정되므로, 즉 샘플 데이터가 수작업 분류없이 획득될 수 있으므로, 모델의 빠른 훈련이 구현될 수 있고, 모델 훈련의 효율이 개선될 수 있으며, 자원 소비가 감소될 수 있다.An embodiment of the present application provides a model training method and a similarity determining method in which a plurality of user data pairs are obtained, wherein two sets of user data data fields in each user data pair have the same portion; User similarity corresponding to each user data pair is obtained; Sample data for training a preset classification model is determined; Then, since the classification model is trained based on the sample data to obtain the similarity classification model, the similarity between the users corresponding to the two sets of detection user data in the detection user data pair may be determined according to the similarity classification model. In this way, a plurality of user data pairs are obtained only through the same data field, and the association between users corresponding to two sets of user data in each user data pair to obtain sample data for training a preset classification model. As determined by the similarity, that is, sample data can be obtained without manual classification, fast training of the model can be implemented, efficiency of model training can be improved, and resource consumption can be reduced.

실시예 2Example 2

도 4에 도시된 바와 같이, 본 출원의 실시예는 데이터 유사성 결정 방법을 제공한다. 방법은 서버에 의해, 또는 단말기 디바이스와 서버에 의해 공동으로 실행될 수 있다. 단말기 디바이스는 개인용 컴퓨터 등일 수 있다. 서버는 독립적인 단일 서버일 수 있거나, 복수의 서버에 의해 형성된 서버 클러스터일 수 있다. 본 출원의 실시예에서, 모델 훈련 효율을 개선하기 위해, 상세한 설명은 서버에 의해 방법이 실행되는 예를 사용하여 이루어진다. 방법이 단말기 디바이스와 서버에 의해 공동으로 구현되는 경우, 다음의 관련된 내용이 참조될 수 있으며, 세부 사항은 여기서 다시 설명되지 않는다. 방법은 구체적으로 다음과 같은 내용을 포함한다:As shown in FIG. 4, an embodiment of the present application provides a method of determining data similarity. The method may be executed by the server or jointly by the terminal device and the server. The terminal device may be a personal computer or the like. The server may be a single independent server or may be a server cluster formed by a plurality of servers. In an embodiment of the present application, to improve model training efficiency, the detailed description is made using an example in which the method is executed by the server. If the method is jointly implemented by a terminal device and a server, reference may be made to the following related contents, and details are not described herein again. The method specifically includes:

현재, 신규한 사용자 신원 검증 방법으로서, 얼굴 인식은 사용자에게 편의를 제공하면서 새로운 위험을 만들어 냈다. 현재의 얼굴 인식 기술의 경우, 사용자의 이미지가 현장에서 캡처되어 얼굴 인식 시스템의 데이터베이스에 미리 저장되어 있는 사용자의 사용자 이미지와 비교되고, 비교를 통해 획득된 값이 미리 결정된 문턱치에 도달하면, 사용자가 미리 저장된 사용자 이미지에 대응하는 사용자인 것으로 결정되며, 이에 따라 사용자의 신원을 검증한다. 그러나, 위와 같은 방법을 사용함으로써 얼굴이 매우 유사한 사용자의 신원을 효과적으로 검증하는 것이 어려우며, 이것은 신원을 검증하지 못함으로 인해 계정의 잘못된 등록 및 계정 자금의 부정 유용을 야기할 가능성이 높다.Currently, as a novel user identity verification method, face recognition has created new risks while providing convenience to the user. With current face recognition technology, the user's image is captured in the field and compared with the user's image pre-stored in the database of the face recognition system, and when the value obtained through the comparison reaches a predetermined threshold, the user It is determined that the user corresponds to the pre-stored user image, thereby verifying the identity of the user. However, it is difficult to effectively verify the identity of a user with a very similar face by using the above method, which is likely to cause incorrect registration of the account and fraudulent usefulness of the account funds due to the inability to verify the identity.

매우 유사한 외모와 연루되는 알려진 가장 전형적인 경우로서, 쌍둥이, 특히 일란성 쌍둥이는 서로 밀접하게 관련되어 있으며 부정하는 대중의 의견을 초래할 가능성이 매우 높다. 가능한 한 많은 쌍둥이 사용자를 포함하는 리스트가 있다면, 위와 같은 위험을 방지하기 위해 이들 사용자를 위한 특별한 얼굴 인식 대처 전략이 설계될 수 있다. 그러므로 쌍둥이를 효과적으로 식별하기 위한 모델은 높은 정확도를 보장하면서 이들 사용자의 얼굴 인식 거동을 모니터링하기 위한 쌍둥이 리스트를 출력하도록 구축될 수 있고, 이에 따라 위험 제어를 달성할 수 있다. 쌍둥이를 효과적으로 식별하기 위한 모델을 구축하는 구현예의 경우, 이하의 단계(S402) 내지 단계(S412)에 의해 제공되는 모델 훈련 방법이 참조될 수 있으며, 구체적인 내용은 다음과 같다:As the most typical known case involving very similar appearances, twins, especially identical twins, are closely related to each other and are very likely to result in denial public opinion. If there is a list containing as many twin users as possible, a special facial recognition coping strategy can be designed for these users to avoid such risks. Therefore, a model for effectively identifying twins can be constructed to output a list of twins for monitoring the facial recognition behavior of these users while ensuring high accuracy, thereby achieving risk control. For an embodiment of building a model for effectively identifying twins, reference may be made to the model training method provided by steps S402 to S412 below, the details of which are as follows:

단계(S402)에서, 복수의 사용자 데이터 쌍이 취득되며, 여기서 각각의 사용자 데이터 쌍 내의 두 세트의 사용자 데이터의 데이터 필드는 동일한 부분을 갖는다.In step S402, a plurality of user data pairs are obtained, wherein the data fields of two sets of user data in each user data pair have the same portion.

구현예에서, 쌍둥이가 일반적으로 동일한 성 및 신원 카드 번호의 동일한 처음 14 숫자를 갖는다는 것을 고려하면, 성 및 신원 카드 번호의 처음 14 숫자는 사용자 데이터 쌍을 선택하기 위한 데이터 필드로서 사용될 수 있다. 단계(S402)의 특정 구현예의 경우, 실시예 1의 단계(S102)의 관련된 내용이 참조될 수 있으며, 세부 사항은 여기서 다시 설명되지 않는다.In an embodiment, considering that twins generally have the same first 14 digits of the same last name and identity card number, the first 14 digits of the last name and identity card number can be used as a data field to select a user data pair. For a particular implementation of step S402, reference may be made to the relevant content of step S102 of Embodiment 1, and details are not described herein again.

사용자 데이터 쌍을 선택하는 처리는 성 및 신원 카드 번호의 처음 14 숫자에 기초하여 구현된다는 것을 알아야 한다. 본 출원의 다른 실시예에서, 사용자 데이터 쌍을 선택하는 처리는 또한 다른 정보에 기초하여, 예를 들면, 성 및 사회 보장 카드 번호, 또는 신원 카드 번호 및 사회 보장 카드의 처음 14 숫자에 기초하여 구현될 수도 있으며, 본 출원의 실시예로 제한되지 않는다.Note that the process of selecting a user data pair is implemented based on the first 14 digits of the last name and identity card number. In another embodiment of the present application, the process of selecting user data pairs is also implemented based on other information, for example, based on gender and social security card numbers, or identity card numbers and first 14 numbers of social security cards. It may be, but is not limited to the embodiment of the present application.

사용자 데이터 쌍 내의 두 세트의 사용자 데이터에 대응하는 사용자 간의 유사성의 정도가 모델 훈련 동안 결정되어야 한다는 것을 고려하면, 다음은 관련된 처리 방식을 제공하며, 구체적으로는 이하의 단계(S404) 및 단계(S406)가 참조될 수 있다.Considering that the degree of similarity between users corresponding to two sets of user data in the user data pair should be determined during model training, the following provides an associated processing scheme, specifically the following steps S404 and S406. May be referenced.

단계(S404)에서, 제1 사용자 데이터 쌍에 대응하는 사용자의 생물학적 특징이 취득되며, 여기서 제1 사용자 데이터 쌍은 복수의 사용자 데이터 쌍 중의 임의의 사용자 데이터 쌍이다.In step S404, biological characteristics of a user corresponding to the first user data pair are obtained, where the first user data pair is any user data pair of the plurality of user data pairs.

생물학적 특징은 지문 특징, 홍채 특징, 얼굴 특징, DNA 또는 다른 생리학적 특징, 또는 성문(voiceprint) 특징, 필기 특징, 키스트로크(keystroke) 습관 또는 다른 행동적 특징과 같은 인체의 생리학적 및 행동적 특징일 수 있다.Biological features include physiological and behavioral features of the human body such as fingerprint features, iris features, facial features, DNA or other physiological features, or voiceprint features, handwriting features, keystroke habits or other behavioral features. Can be.

구현예에서, 복수의 사용자 데이터 쌍이 위의 단계(S402)를 통해 취득된 후, 사용자 데이터 쌍(즉, 제1 사용자 데이터 쌍)은 복수의 사용자 데이터 쌍으로부터 임의로 선택될 수 있다. 사용자가 단말기 디바이스를 사용하여 등록을 위해 서버에 로그인할 때, 사용자는 사용자의 위와 같은 생물학적 특징 중 하나 이상을 서버에 업로드할 수 있다. 서버는 생물학적 특징 및 사용자의 식별자를 연관된 방식으로 저장할 수 있다. 사용자의 식별자는 등록 동안 사용자에 의해 입력된 사용자의 사용자 성 또는 이름일 수 있다. 연관된 방식으로 서버에 저장된 위의 정보는 표 1에 도시된 바와 같을 수 있다.In an implementation, after the plurality of user data pairs are acquired through step S402 above, the user data pairs (ie, the first user data pair) may be arbitrarily selected from the plurality of user data pairs. When the user logs in to the server for registration using the terminal device, the user may upload one or more of the above biological characteristics of the user to the server. The server may store the biological feature and the user's identifier in an associated manner. The identifier of the user may be the user's last name or first name of the user entered by the user during registration. The above information stored in the server in an associated manner may be as shown in Table 1.

사용자의 식별자User's identifier 생물학적 특징Biological characteristics 사용자 1User 1 생물학적 특징 ABiological feature A 사용자 2User 2 생물학적 특징 BBiological feature B 사용자 3User 3 생물학적 특징 CBiological Characteristics C

제1 사용자 데이터 쌍을 선택한 이후에, 서버는 제1 사용자 데이터 쌍에 포함된 사용자의 식별자를 추출한 다음, 사용자의 식별자에 따라 대응하는 생물학적 특징을 취득하며, 이에 따라 제1 사용자 데이터 쌍에 대응하는 사용자의 생물학적 특징을 획득할 수 있다. 예를 들어, 제1 사용자 데이터 쌍에 포함된 사용자의 식별자는 사용자 2 및 사용자 3이고, 위의 표에서 대응하는 관계를 질의함으로써, 사용자 2가 생물학적 특징 B에 대응하고, 사용자 3이 생물학적 특징 C에 대응하는 것으로 결정될 수 있는데, 즉, 제1 사용자 데이터 쌍에 대응하는 사용자의 생물학적 특징은 생물학적 특징 B 및 생물학적 특징 C인 것으로 결정될 수 있다.단계(S406)에서, 제1 사용자 데이터 쌍에 대응하는 사용자 유사성은 제1 사용자 데이터 쌍에 대응하는 사용자의 생물학적 특징에 따라 결정된다.After selecting the first user data pair, the server extracts the identifier of the user included in the first user data pair, and then acquires a corresponding biological characteristic according to the user's identifier, thereby corresponding to the first user data pair. Obtain biological characteristics of the user. For example, the identifiers of the users included in the first user data pair are User 2 and User 3, and by querying the corresponding relationship in the table above, User 2 corresponds to Biological Feature B, and User 3 corresponds to Biological Feature C. The biological characteristics of the user corresponding to the first user data pair may be determined to be biological characteristic B and biological characteristic C. In step S406, the biological characteristics of the user corresponding to the first user data pair may be determined. User similarity is determined according to the biological characteristics of the user corresponding to the first user data pair.

구현예에서, 제1 사용자 데이터 쌍에 대응하는 사용자의 생물학적 특징이 위의 단계(S404)를 통해 획득된 이후에, 대응하는 두 사용자 간의 유사성(즉, 사용자 유사성)의 정도를 결정하기 위해, 획득된 생물학적 특징에 대해 각각 유사성 계산이 수행된다. 유사성 계산은, 예를 들어, 특징 벡터 간의 유클리드 거리(Euclidean distance)에 따라 다양한 방식으로 구현될 수 있으며, 본 출원의 실시예로 제한되지 않는다.In an embodiment, after a biological characteristic of a user corresponding to the first user data pair is obtained through step S404 above, to determine the degree of similarity (ie, user similarity) between the two corresponding users, Similarity calculations are performed for each of the identified biological features. The similarity calculation can be implemented in various ways, for example, depending on the Euclidean distance between feature vectors, and is not limited to the embodiments of the present application.

사용자가 유사한 지를 결정하기 위해 문턱치가 설정될 수 있다는 것을 알아야 한다. 예를 들어, 문턱치는 70으로 설정된다. 두 개의 생물학적 특징에 대응하는 사용자 유사성이 70 이상일 때, 제1 사용자 데이터 쌍 내의 두 세트의 사용자 데이터에 대응하는 사용자는 유사한 것으로 결정되고; 두 개의 생물학적 특징에 대응하는 사용자 유사성이 70 미만일 때, 제1 사용자 데이터 쌍 내의 두 세트의 사용자 데이터에 대응하는 사용자는 유사하지 않은 것으로 결정된다.It should be appreciated that the threshold can be set to determine if the user is similar. For example, the threshold is set at 70. When the user similarity corresponding to the two biological features is 70 or more, the user corresponding to the two sets of user data in the first user data pair is determined to be similar; When the user similarity corresponding to the two biological features is less than 70, it is determined that the users corresponding to the two sets of user data in the first user data pair are not similar.

위의 방법에 의해, 위의 처리 절차는 복수의 사용자 데이터 쌍 내의 제1 사용자 데이터 쌍 이외에 다른 사용자 데이터 쌍에 대해 수행되어, 복수의 사용자 데이터 쌍 내의 각각의 사용자 데이터 쌍에 대응하는 사용자 유사성을 획득할 수 있다.By the above method, the above processing procedure is performed on user data pairs other than the first user data pair in the plurality of user data pairs to obtain user similarity corresponding to each user data pair in the plurality of user data pairs. can do.

위의 단계(S404) 및 단계(S406)에서, 사용자 유사성은 사용자의 생물학적 특징에 따라 결정된다. 실제 응용에서, 사용자 유사성은 다양한 방식으로 구체적으로 결정될 수 있다. 단계(S404) 및 단계(S406)는 생물학적 특징이 얼굴 특징인 예를 사용하여 아래에서 구체적으로 설명되며, 세부 사항에 대해서는 다음의 단계 1 및 단계 2가 참조될 수 있다.In step S404 and step S406 above, the user similarity is determined according to the biological characteristics of the user. In practical applications, user similarity can be specifically determined in various ways. Steps S404 and S406 are described in detail below using examples in which the biological features are facial features, and the following steps 1 and 2 may be referred to for details.

단계 1에서, 제1 사용자 데이터 쌍에 대응하는 사용자의 얼굴 이미지가 취득되며, 여기서 제1 사용자 데이터 쌍은 복수의 사용자 데이터 쌍 중의 임의의 사용자 데이터 쌍이다.In step 1, a face image of a user corresponding to the first user data pair is obtained, where the first user data pair is any user data pair of the plurality of user data pairs.

구현예에서, 복수의 사용자 데이터 쌍이 위의 단계(S402)를 통해 취득된 이후에, 하나의 사용자 데이터 쌍(즉, 제1 사용자 데이터 쌍)이 복수의 사용자 데이터 쌍으로부터 임의로 선택될 수 있다. 사용자가 단말기 디바이스를 사용하여 등록을 위해 서버에 로그인할 때, 사용자는 사용자의 얼굴을 포함하는 이미지를 서버에 업로드할 수 있다. 서버는 이미지 및 사용자의 식별자를 연관된 방식으로 저장할 수 있다. 사용자의 식별자는 등록 동안 사용자에 의해 입력된 사용자의 사용자 성 또는 이름일 수 있다. 연관된 방식으로 서버에 저장된 위의 정보는 표 2에 도시된 바와 같을 수 있다.In an implementation, after a plurality of user data pairs are acquired through step S402 above, one user data pair (ie, a first user data pair) may be arbitrarily selected from the plurality of user data pairs. When the user logs in to the server for registration using the terminal device, the user can upload an image containing the user's face to the server. The server may store the image and the user's identifier in an associated manner. The identifier of the user may be the user's last name or first name of the user entered by the user during registration. The above information stored in the server in an associated manner may be as shown in Table 2.

사용자의 식별자User's identifier 사용자의 얼굴을 포함하는 이미지Image containing the user's face 사용자 1User 1 이미지 AImage A 사용자 2User 2 이미지 BImage B 사용자 3User 3 이미지 CImage C

제1 사용자 데이터 쌍을 취득한 이후에, 서버는 제1 사용자 데이터 쌍에 포함된 사용자의 식별자를 추출할 수 있고, 그런 다음 사용자의 식별자에 따라 대응하는 이미지를 획득할 수 있으며, 이에 따라 제1 사용자 데이터 쌍에 대응하는 사용자의 얼굴 이미지를 획득할 수 있다. 예를 들어, 제1 사용자 데이터 쌍에 포함된 사용자의 식별자는 사용자 2 및 사용자 3이며, 위의 표에서 대응하는 관계를 질의함으로써, 사용자의 얼굴을 포함하고 사용자 2에 대응하는 이미지가 이미지 B이고, 사용자의 얼굴을 포함하고 사용자 3에 대응하는 이미지가 이미지 C인 것으로 결정될 수 있는데, 즉, 제1 사용자 데이터 쌍에 대응하는 사용자의 얼굴 이미지는 이미지 B 및 이미지 C인 것으로 결정될 수 있다.단계 2에서, 얼굴 이미지에 대해 특징 추출이 수행되어 얼굴 이미지 특징을 획득하고, 제1 사용자 데이터 쌍에 대응하는 사용자 유사성이 제1 사용자 데이터 쌍에 대응하는 사용자의 얼굴 이미지 특징에 따라 결정된다.After acquiring the first user data pair, the server may extract the identifier of the user included in the first user data pair, and then obtain a corresponding image according to the identifier of the user, thereby obtaining the first user A face image of the user corresponding to the data pair may be obtained. For example, the identifiers of the users included in the first user data pair are User 2 and User 3, and by querying the corresponding relationship in the table above, the image containing the user's face and corresponding to User 2 is image B It may be determined that the image including the face of the user and corresponding to the user 3 is the image C, that is, the face image of the user corresponding to the first user data pair is determined to be the image B and the image C. Step 2 In FIG. 4, feature extraction is performed on a face image to obtain a face image feature, and a user similarity corresponding to the first user data pair is determined according to a face image feature of the user corresponding to the first user data pair.

구현예에서, 제1 사용자 데이터 쌍에 대응하는 사용자의 얼굴 이미지가 단계 1을 통해 획득된 이후에, 획득된 얼굴 이미지에 대해 각각 특징 추출이 각각 수행되어 대응하는 얼굴 이미지 특징을 획득할 수 있고, 각각의 얼굴 이미지의 추출된 특징에 기초하여 대응하는 특징 벡터가 획득되며; 그런 다음 임의의 두 개의 얼굴 이미지의 특징 벡터 간의 유클리드 거리가 계산될 수 있고, 특징 벡터 간의 유클리드 거리의 값에 따라, 대응하는 두 사용자 간의 유사성(즉, 사용자 유사성)의 정도가 결정될 수 있다. 특징 벡터 간의 유클리드 거리의 값이 클수록, 사용자의 유사성은 더 낮아지고; 특징 벡터 간의 유클리드 거리의 값이 작을수록, 사용자 유사성은 더 높아진다.In an embodiment, after the face image of the user corresponding to the first user data pair is obtained through step 1, feature extraction may be performed on the obtained face image, respectively, to obtain a corresponding face image feature, A corresponding feature vector is obtained based on the extracted feature of each face image; The Euclidean distance between the feature vectors of any two face images can then be calculated, and the degree of similarity (ie, user similarity) between the corresponding two users can be determined according to the value of the Euclidean distance between the feature vectors. The larger the value of the Euclidean distance between feature vectors, the lower the similarity of the user; The smaller the value of the Euclidean distance between feature vectors, the higher the user similarity.

얼굴 이미지의 경우, 두 개의 얼굴 이미지는 유사할 수도 또는 유사하지 않을 수도 있다는 것을 알아야 한다. 따라서, 이미지가 유사한 지를 결정하기 위한 문턱치가 설정될 수 있다. 예를 들어, 문턱치는 70으로 설정된다. 두 개의 얼굴 이미지에 대응하는 사용자 유사성이 70 이상일 때, 제1 사용자 데이터 쌍 내의 두 세트의 사용자 데이터에 대응하는 사용자가 유사한 것으로 결정되고; 두 개의 얼굴 이미지에 대응하는 사용자 유사성이 70 미만일 때, 제1 사용자 데이터 쌍 내의 두 세트의 사용자 데이터에 대응하는 사용자는 유사하지 않은 것으로 결정된다.It should be noted that for face images, the two face images may or may not be similar. Thus, a threshold for determining whether the images are similar can be set. For example, the threshold is set at 70. When the user similarity corresponding to the two face images is 70 or more, it is determined that the users corresponding to the two sets of user data in the first user data pair are similar; When the user similarity corresponding to the two face images is less than 70, it is determined that the users corresponding to the two sets of user data in the first user data pair are not similar.

예를 들어, 단계 1의 예에 기초하여, 이미지 B 및 이미지 C에 대해 특징 추출이 각각 수행되고, 추출된 특징에 따라 대응하는 특징 벡터가 각각 구축되어 이미지 B의 특징 벡터 및 이미지 C의 특징 벡터를 획득한다. 이미지 B의 특징 벡터와 이미지 C의 특징 벡터 간의 유클리드 거리가 계산되고, 획득된 유클리드 거리의 값에 따라 사용자 2와 사용자 3 사이의 사용자 유사성이 결정된다.For example, based on the example of step 1, feature extraction is performed on image B and image C, respectively, and corresponding feature vectors are constructed according to the extracted features, respectively, so that feature vector of image B and feature vector of image C. Acquire. The Euclidean distance between the feature vector of Image B and the feature vector of Image C is calculated, and the user similarity between User 2 and User 3 is determined according to the obtained Euclidean distance value.

위의 방법에 의해, 위의 처리 절차는 복수의 사용자 데이터 쌍 내의 제1 사용자 데이터 쌍 이외의 다른 사용자 데이터 쌍에 대해 수행되어, 복수의 사용자 데이터 쌍 내의 각각의 사용자 데이터 쌍에 대응하는 사용자 유사성을 획득할 수 있다.By the above method, the above processing procedure is performed on user data pairs other than the first user data pair in the plurality of user data pairs, so that user similarity corresponding to each user data pair in the plurality of user data pairs is obtained. Can be obtained.

또한, 위의 단계(S404) 및 단계(S406)의 처리를 위해, 다음은 임의적 처리 방식을 추가로 제공하며, 세부 사항에 대해서는 다음의 단계 1 및 단계 2가 참조될 수 있다.In addition, for the processing of the above step (S404) and step (S406), the following further provides an optional processing scheme, the following steps 1 and 2 can be referred to for details.

단계 1에서, 제1 사용자 데이터 쌍에 대응하는 사용자의 음성 데이터가 취득되며, 여기서 제1 사용자 데이터 쌍은 복수의 사용자 데이터 쌍 중의 임의의 사용자 데이터 쌍이다.In step 1, voice data of a user corresponding to the first user data pair is obtained, wherein the first user data pair is any user data pair of the plurality of user data pairs.

구현예에서, 복수의 사용자 데이터 쌍이 위의 단계(S402)를 통해 취득된 이후에, 하나의 사용자 데이터 쌍(즉, 제1 사용자 데이터 쌍)이 복수의 사용자 데이터 쌍으로부터 임의로 선택될 수 있다. 사용자가 단말기 디바이스를 사용하여 등록을 위해 서버에 로그 인할 때, 사용자는 미리 결정된 지속시간(예를 들어, 3 초 또는 5 초)을 갖고 및/또는 미리 결정된 음성 내용(예를 들어, 하나 이상의 단어 또는 한 문장의 음성)을 포함하는 음성 데이터를 서버에 업로드할 수 있다. 서버는 음성 데이터 및 사용자의 식별자를 연관된 방식으로 저장할 수 있다. 제1 사용자 데이터 쌍을 선택한 이후에, 서버는 제1 사용자 데이터 쌍에 포함된 사용자의 식별자를 각각 추출한 다음, 사용자의 식별자에 따라 대응하는 음성 데이터를 취득하며, 이에 따라 제2 사용자 데이터 쌍에 대응하는 사용자의 음성 데이터를 획득할 수 있다.In an implementation, after a plurality of user data pairs are acquired through step S402 above, one user data pair (ie, a first user data pair) may be arbitrarily selected from the plurality of user data pairs. When the user logs in to the server for registration using the terminal device, the user has a predetermined duration (eg 3 seconds or 5 seconds) and / or predetermined voice content (eg one or more). Voice data including a word or a sentence) can be uploaded to the server. The server may store the voice data and the user's identifier in an associated manner. After selecting the first user data pair, the server extracts each user's identifier included in the first user data pair, and then acquires corresponding voice data according to the user's identifier, thereby corresponding to the second user data pair. It is possible to obtain the voice data of the user.

단계 2에서, 음성 데이터에 대해 특징 추출이 수행되어 음성 특징을 획득하고, 제1 사용자 데이터 쌍에 대응하는 사용자 유사성이 제1 사용자 데이터 쌍에 대응하는 사용자의 음성 특징에 따라 결정된다.In step 2, feature extraction is performed on the voice data to obtain a voice feature, and the user similarity corresponding to the first user data pair is determined according to the voice feature of the user corresponding to the first user data pair.

구현예에서, 제1 사용자 데이터 쌍에 대응하는 사용자의 음성 데이터가 위의 단계 1을 통해 획득된 이후에, 획득된 음성 데이터에 대해 특징 추출이 각각 수행되고, 각 조각의 음성 데이터의 추출된 특징에 기초하여, 대응하는 두 사용자 간의 유사성(즉, 사용자 유사성)의 정도가 결정될 수 있다. 특정 구현예에 대해서는 위의 단계(S406)에서 관련된 내용이 참조될 수 있다. 대안적으로, 특징의 일대일 비교를 통해 사용자 유사성이 결정될 수 있고; 또는 임의의 두 조각의 음성 데이터에 대해 음성 스펙트럼 분석이 수행되어 사용자 유사성을 결정할 수 있다. 위의 방법에 의해, 위의 처리 절차는 복수의 사용자 데이터 쌍 내의 제1 사용자 데이터 쌍 이외의 다른 사용자 데이터 쌍에 대해 수행되어, 복수의 사용자 데이터 쌍 내의 각각의 사용자 데이터 쌍에 대응하는 사용자 유사성을 획득할 수 있다.In an embodiment, after the voice data of the user corresponding to the first user data pair is obtained through step 1 above, feature extraction is performed on the obtained voice data, respectively, and the extracted feature of the voice data of each piece is extracted. Based on this, the degree of similarity (ie, user similarity) between the two corresponding users may be determined. For specific implementations, reference may be made to the related contents in step S406 above. Alternatively, user similarity can be determined through a one-to-one comparison of features; Or speech spectral analysis may be performed on any two pieces of speech data to determine user similarity. By the above method, the above processing procedure is performed on user data pairs other than the first user data pair in the plurality of user data pairs, so that user similarity corresponding to each user data pair in the plurality of user data pairs is obtained. Can be obtained.

단계(S408)에서, 복수의 사용자 데이터 쌍 내의 각각의 사용자 데이터 쌍에 대해 특징 추출이 수행되어 각각의 사용자 데이터 쌍 내의 두 세트의 사용자 데이터 사이에서 연관된 사용자 특징을 획득한다.In step S408, feature extraction is performed for each user data pair in the plurality of user data pairs to obtain an associated user feature between two sets of user data in each user data pair.

(제 3 사용자 데이터 쌍이라고 지칭될 수 있는) 사용자 데이터 쌍이 복수의 사용자 데이터 쌍으로부터 임의로 선택될 수 있는 구현예에서, 특징 추출은 제 3 사용자 데이터 쌍 내의 상이한 두 세트의 사용자 데이터에 대해 각각 수행될 수 있다. 예를 들어, 제 3 사용자 데이터 쌍은 사용자 데이터 1 및 사용자 데이터 2를 포함하고, 특징 추출은 사용자 데이터 1 및 사용자 데이터 2에 대해 각각 수행될 수 있다. 그런 다음, 제 3 사용자 데이터 쌍 내의 두 세트의 사용자 데이터 사이의 연관된 사용자 특징이 서로 다른 사용자 데이터로부터 추출된 특징을 비교함으로써 획득될 수 있다. 위의 방법에 의해, 위의 처리 절차는 복수의 사용자 데이터 쌍 내의 제 3 사용자 데이터 쌍 이외에 다른 사용자 데이터 쌍에 대해 수행되어, 각각의 사용자 데이터 쌍 내의 두 세트의 사용자 데이터 사이의 연관된 사용자 특징을 획득할 수 있다.In an implementation where a user data pair (which may be referred to as a third user data pair) may be arbitrarily selected from a plurality of user data pairs, feature extraction may be performed on two different sets of user data, respectively, within the third user data pair. Can be. For example, the third user data pair may include user data 1 and user data 2, and feature extraction may be performed on user data 1 and user data 2, respectively. Then, an associated user feature between two sets of user data in the third user data pair can be obtained by comparing features extracted from different user data. By the above method, the above processing procedure is performed on other user data pairs in addition to the third user data pairs in the plurality of user data pairs, so as to obtain an associated user characteristic between two sets of user data in each user data pair. can do.

실제 응용에서, 사용자 특징은 이것으로 제한되는 것은 아니지만, 세대 등록 차원 특징, 이름 차원 특징, 사회적 특징 및 관심 특징 등을 포함할 수 있다. 세대 등록 차원 특징은 사용자 신원 정보의 특징을 포함할 수 있다. 세대 등록 차원 특징은 주로 중국의 가계 등록 관리 시스템에 기초한다. 세대 등록에 포함된 신원 카드 정보는 생년월일과 세대 등록 장소를 포함하고, 세대 등록은 부모의 이름과 주민의 주소를 포함한다. 그러나 역사적 이유 및 기타 이유로 인해, 일부 시민의 등록 정보는 실제 상황과 동일하지 않다. 예를 들어, 등록된 생년월일이 실제 날짜보다 이를 수도 있고, 두 자녀가 각각 부모의 성을 따를 수도 있고, 또는 부모의 이혼으로 인해 세대 등록의 분리를 가져오기도 한다. 따라서, 세대 등록 차원 특징은 두 명의 사용자가 쌍둥이인지를 결정하기 위한 기준으로서 작용할 수 있다. 이러한 방식으로, 서로 다른 사용자 간의 연관성은 사용자 데이터 쌍에 대응하는 서로 다른 사용자가 동일한 생년월일, 동일한 세대 등록 장소, 동일한 부모 또는 동일한 현재 주소를 갖는지와 같은 특징에 따라 결정된다.In practical applications, user features may include, but are not limited to, household registration dimension features, name dimension features, social features, features of interest, and the like. The household registration dimension feature may include a feature of user identity information. The generation registration dimension is mainly based on China's household registration management system. The identity card information included in the household registration includes the date of birth and the place of household registration, and the household registration includes the name of the parent and the address of the residents. However, for historical and other reasons, some citizens' registration information is not the same as the actual situation. For example, a registered date of birth may be earlier than the actual date, two children may each follow the parents' last name, or a divorce may result in separation of household registration. Thus, the generation registration dimension feature can serve as a criterion for determining whether two users are twins. In this way, the association between different users is determined according to characteristics such as whether different users corresponding to the user data pair have the same date of birth, the same household registration place, the same parent, or the same current address.

이름 차원 특징은 사용자 이름 정보의 특징 및 사용자 성의 희소성 정도의 특징을 포함한다. 이름 차원 특징의 경우, 자연어 처리(Nature Language Processing)(NLP) 이론 및 사회적 경험에 기초하여, 일반적으로 두 사람의 이름이 Zhang Jinlong와 Zhang Jinhu와 같이 비슷해 보이거나, 또는 Zhang Meimei와 Zhang Lili와 같이 특정한 의미 관계가 있으면, 둘 사이에 연관성이 있다고 간주된다. 본 출원의 실시예에서, 두 사용자의 이름 사이의 관계는 사전을 사용하여 평가될 수 있고, 사용자의 등록된 개인 정보 및 인구 통계 데이터는 성의 희소성 정도를 특징으로서 계산하는 데 사용된다. 이러한 방식으로, 서로 다른 사용자 간의 연관성은 사용자 데이터 쌍에 대응하는 서로 다른 사용자가 동일한 성을 갖는지 또는 동일한 길이의 이름을 갖는지, 이름의 동의어의 정도, 이름의 조합이 문구인지, 성의 희소성 정도 등과 같은 특징에 따라 결정된다.Name dimension features include features of the user name information and the degree of scarcity of the user's last name. In the case of name dimension features, based on Natural Language Processing (NLP) theory and social experience, the names of two people generally look similar to Zhang Jinlong and Zhang Jinhu, or Zhang Zhang Meimei and Zhang Lili. If there is a specific semantic relationship, there is a relation between the two. In an embodiment of the present application, the relationship between two user's names can be evaluated using a dictionary, and the user's registered personal information and demographic data are used to calculate the degree of scarcity of the gender as a feature. In this way, the association between different users is such that whether different users corresponding to the user data pair have the same last name or the same length name, the degree of synonym of the name, the combination of the names is the phrase, the degree of scarcity of the last name, etc. It depends on the characteristics.

사회적 특징은 사용자의 사회적 관계 정보의 특징을 포함한다. 사회적 특징은 빅 데이터에 기초하여 사용자 데이터 쌍의 사회적 관계를 추출함으로써 획득될 수 있다. 일반적으로, 쌍둥이는 서로 자주 상호 작용할 것이고 동일한 친척이나 심지어 동급생과 같은 크게 겹치는 사회적 관계를 가질 것이다. 본 출원의 실시예에서, 사용자 데이터 쌍은 서버에 저장된 사용자의 개인 정보 및 기존의 데이터, 주소록 등에 의해 형성된 관계 네트워크에 기초하여 연관되어 대응하는 특징을 획득한다. 이러한 방식으로, 사용자 데이터 쌍에 대응하는 서로 다른 사용자 간의 연관성은 소셜 네트워킹 애플리케이션에서 서로 다른 사용자가 서로를 따라 하는지, 서로 다른 사용자가 서로 간에 자금을 이체했는지, 서로 다른 사용자가 상대방의 연락처 정보를 주소록에 저장하였는지, 서로 다른 사용자가 주소록에서 상대방에 특정 호칭을 표시했는지, 이들의 주소록 사이에 공통 연락처의 수량 등과 같은 특징에 따라 결정된다.Social features include features of the user's social relationship information. Social features may be obtained by extracting social relationships of user data pairs based on big data. In general, twins will often interact with each other and have large overlapping social relationships, such as the same relatives or even classmates. In an embodiment of the present application, the user data pairs are associated and obtain corresponding features based on the personal information of the user stored in the server and the relational network formed by existing data, address book, and the like. In this way, the association between different users corresponding to a pair of user data means that in a social networking application, different users follow each other, different users transfer funds between each other, and different users can read their contact information. Is stored in the network, different users display specific names in the address book in the address book, and the number of common contacts among these address books.

또한, 쌍둥이는 공통으로 많은 취미 및 쇼핑 선호도를 갖고 함께 여행할 수 있다는 것을 고려하면, 사용자 특징은 전자 상거래, 관광, 엔터테인먼트 및 다른 차원의 특징을 더 포함할 수 있다. 본 출원의 실시예에서, 전자 상거래, 관광, 엔터테인먼트 및 다른 차원의 특징과 관련된 데이터는 미리 결정된 데이터베이스 또는 웹 사이트로부터 취득될 수 있다. 이러한 방식으로, 사용자 데이터 쌍에 대응하는 서로 다른 사용자 간의 연관성은 서로 다른 사용자 간의 공통 쇼핑 기록의 양, 이들이 함께 여행했는지, 이들이 동일한 시간에 호텔에 체크인했는지, 이들의 쇼핑 선호도 간의 유사성, 이들이 동일한 배송 주소를 갖는지 등과 같은 특징에 따라 결정된다.Further, given that twins can travel together with many hobbies and shopping preferences in common, the user features may further include e-commerce, tourism, entertainment and other dimensions. In embodiments of the present application, data relating to e-commerce, tourism, entertainment, and other dimensional features may be obtained from a predetermined database or website. In this way, the association between different users corresponding to a pair of user data is the amount of common shopping records between different users, whether they traveled together, whether they checked in to the hotel at the same time, the similarity between their shopping preferences, and the same delivery. It depends on features such as having an address.

사용자 유사성을 결정하는 처리(단계(S404) 및 단계(S406)를 포함) 및 위의 특징 추출 처리(즉, 단계(S408))는 시간 순서대로 실행된다는 것을 알아야 한다. 실제 응용에서, 사용자 유사성을 결정하는 처리 및 특징 추출의 처리는 또한 동시에 또는 역순으로 실행될 수 있으며, 이것은 본 출원의 실시예로 제한되지 않는다.It should be noted that the process of determining user similarity (including step S404 and step S406) and the feature extraction process above (i.e., step S408) are executed in chronological order. In practical applications, the processing of determining user similarity and the processing of feature extraction may also be executed simultaneously or in reverse order, which is not limited to the embodiments of the present application.

단계(S410)에서, 각각의 사용자 데이터 쌍 내의 두 세트의 사용자 데이터 사이의 연관된 사용자 특징 및 각각의 사용자 데이터 쌍에 대응하는 사용자 유사성에 따라, 분류 모델을 훈련시키기 위한 샘플 데이터가 결정된다.In step S410, sample data for training the classification model is determined according to the associated user characteristics between the two sets of user data in each user data pair and the user similarity corresponding to each user data pair.

구현예에서, 문턱치는 사전에 설정될 수 있다. 문턱치에 따라, 문턱치보다 큰 사용자 유사성에 대응하는 사용자 데이터 쌍이 복수의 사용자 데이터 쌍으로부터 선택될 수 있다. 각각의 선택된 사용자 데이터 쌍 내의 두 세트의 사용자 데이터 사이의 연관된 사용자 특징은 분류 모델을 훈련시키기 위한 사용자 특징으로서 사용될 수 있다. 선택된 사용자 특징 및 선택된 사용자 데이터 쌍에 대응하는 사용자 유사성은 분류 모델을 훈련시키기 위한 샘플 데이터로서 결정될 수 있다.In an implementation, the threshold may be set in advance. Depending on the threshold, user data pairs corresponding to user similarities greater than the threshold may be selected from the plurality of user data pairs. The associated user feature between two sets of user data in each selected user data pair can be used as a user feature to train the classification model. The user similarity corresponding to the selected user feature and the selected user data pair may be determined as sample data for training the classification model.

단계(S410)의 처리는 전술한 방식 이외에 다양한 다른 방식으로 구현될 수 있다. 다음은 구체적으로 다음의 단계 1 및 단계 2를 포함하는 선택적 처리 방식을 추가로 제공한다:The processing of step S410 may be implemented in various other ways besides the manners described above. The following further provides an optional treatment scheme specifically comprising the following Step 1 and Step 2:

단계 1에서, 포지티브 샘플 특징 및 네거티브 샘플 특징이 각각의 사용자 데이터 쌍에 대응하는 사용자 유사성 및 미리 결정된 유사성 문턱치에 따라 복수의 사용자 데이터 쌍에 대응하는 사용자 특징으로부터 선택된다.In step 1, a positive sample feature and a negative sample feature are selected from user features corresponding to the plurality of user data pairs according to a user similarity corresponding to each user data pair and a predetermined similarity threshold.

구현예에서, 일란성 쌍둥이가 매우 유사한 외모, 동일한 생년월일 및 출생지, 및 일반적으로 동일한 성을 갖는다는 상식에 기초하여, 두 사용자가 일란성 쌍둥이인지를 결정하기 위해, 두 사용자의 얼굴 이미지 간의 사용자 유사성이 계산된다. 구체적으로, 예를 들어 80 % 또는 70 %의 유사성 문턱치가 사전에 설정될 수 있다. 유사성 문턱치보다 큰 사용자 유사성에 대응하는 사용자 데이터 쌍은 일란성 쌍둥이의 사용자 데이터 쌍으로 결정될 수 있고, 유사성 문턱치보다 작은 사용자 유사성에 대응하는 사용자 데이터 쌍은 일란성이 아닌 쌍둥이의 사용자 데이터 쌍으로 결정될 수 있다. 한편, 일란성 쌍둥이 및 이란성 쌍둥이는 외모를 제외하고는 기본적으로 동일한 특징을 갖기 때문에, 일란성 쌍둥이의 사용자 데이터 쌍에 대응하는 사용자 특징은 유사성 분류 모델의 포지티브 샘플 특징으로 사용될 수 있고, (이란성 쌍둥이 및 쌍둥이가 아닌 것을 비롯하여) 일란성이 아닌 쌍둥이의 사용자 데이터 쌍에 대응하는 사용자 특징은 유사성 분류 모델의 네거티브 샘플 특징으로 사용될 수 있다.In an embodiment, based on the common sense that identical twins have very similar appearance, same date of birth and place of birth, and generally the same last name, to calculate whether two users are identical twins, the user similarity between two user's face images is calculated do. Specifically, for example, a similarity threshold of 80% or 70% may be set in advance. The user data pair corresponding to the user similarity larger than the similarity threshold may be determined as the user data pair of identical twins, and the user data pair corresponding to the user similarity smaller than the similarity threshold may be determined as the user data pair of the twin not identical. On the other hand, since identical twins and fraternal twins have basically the same characteristics except for their appearance, the user features corresponding to the user data pairs of identical twins can be used as positive sample features of the similarity classification model, User features corresponding to user data pairs of twins that are not identical may be used as negative sample features of the similarity classification model.

네거티브 샘플 특징은 그 안에 포함된 특징이 이란성 쌍둥이의 모든 사용자 특징이라는 것을 의미하지는 않는다는 것을 알아야 한다. 실제 응용에서, 이란성 쌍둥이의 사용자 특징은 네거티브 샘플 특징에서 극히 작은 부분을 차지할 수 있고, 또는 네거티브 샘플 특징은 포지티브 샘플 특징 중 소수를 포함할 수 있는데, 이것은 분류 모델의 훈련에 영향을 미치지는 않지만 유사성 분류 모델의 강인성을 개선할 것이다.It should be noted that a negative sample feature does not mean that the features contained therein are all user features of the fraternal twins. In practical applications, user features of fraternal twins can occupy a very small portion of negative sample features, or negative sample features can include a small number of positive sample features, which do not affect the training of the classification model, but have similarities. It will improve the robustness of the classification model.

또한, 포지티브 샘플 특징은 네거티브 샘플 특징과 동일한 수량의 특징을 포함할 수 있다. 예를 들어, 복수의 사용자 데이터 쌍으로부터 10 % 미만의 사용자 유사성에 대응하는 10000 개의 사용자 데이터 쌍이 선택되고, 복수의 사용자 데이터 쌍으로부터 10 % 초과 및 20 % 미만의 사용자 유사성에 대응하는 10000 개의 사용자 데이터 쌍이 선택되고, 복수의 사용자 데이터 쌍으로부터 20 % 초과 및 30 % 미만의 사용자 유사성에 대응하는 10000 개의 사용자 데이터 쌍이 선택되고, 복수의 사용자 데이터 쌍으로부터 30 % 초과 및 40 % 미만의 사용자 유사성에 대응하는 10000 개의 사용자 데이터 쌍이 선택되며, 복수의 사용자 데이터 쌍으로부터 40 % 초과 및 50 % 미만의 사용자 유사성에 대응하는 10000 개의 사용자 데이터 쌍이 선택된다. 위의 50000 개의 사용자 데이터 쌍의 사용자 특징은 네거티브 샘플 특징으로 사용된다. 80 % 초과 및 90 % 미만의 사용자 유사성에 대응하는 40000 개의 사용자 데이터 쌍이 복수의 사용자 데이터 쌍으로부터 선택되고, 90 % 초과 및 100 % 미만의 사용자 유사성에 대응하는 10000 개의 사용자 데이터 쌍이 복수의 사용자 데이터 쌍으로부터 선택된다. 위의 50000 개의 사용자 데이터 쌍의 사용자 특징은 포지티브 샘플 특징으로 사용된다.In addition, the positive sample feature may include the same quantity of features as the negative sample feature. For example, 10000 user data pairs corresponding to less than 10% user similarity are selected from a plurality of user data pairs, and 10000 user data corresponding to greater than 10% and less than 20% user similarity from a plurality of user data pairs. A pair is selected and 10000 user data pairs corresponding to more than 20% and less than 30% user similarity from the plurality of user data pairs are selected and corresponding to more than 30% and less than 40% user similarity from the plurality of user data pairs. 10000 user data pairs are selected, and 10000 user data pairs corresponding to more than 40% and less than 50% user similarity are selected from the plurality of user data pairs. The user feature of the 50000 user data pair above is used as the negative sample feature. 40000 user data pairs corresponding to more than 80% and less than 90% user similarity are selected from the plurality of user data pairs, and 10000 user data pairs corresponding to more than 90% and less than 100% user similarity are multiple user data pairs Is selected from. The user feature of the 50000 user data pair above is used as a positive sample feature.

단계 2에서, 포지티브 샘플 특징 및 네거티브 샘플 특징은 분류 모델을 훈련시키기 위한 샘플 데이터로서 사용된다.In step 2, the positive sample feature and the negative sample feature are used as sample data for training the classification model.

구현예에서, 사용자 특징의 데이터와 대응하는 사용자 유사성이 결합될 수 있고, 결합된 데이터는 분류 모델을 훈련시키기 위한 샘플 데이터로서 사용될 수 있다.In an implementation, data of the user characteristic and corresponding user similarity may be combined, and the combined data may be used as sample data for training the classification model.

단계(S412)에서, 분류 모델은 샘플 데이터에 기초하여 훈련되어 유사성 분류 모델을 획득한다.In step S412, the classification model is trained based on the sample data to obtain a similarity classification model.

분류 모델의 주요 목적은 쌍둥이를 식별하는 것이기 때문에, 유사성 분류 모델은 이진 분류기 모델, 구체적으로는 본 출원의 실시예를 간단하고 실현 가능하게 하기 위해 그레디언트 부스팅 결정 트리(Gradient Boosting Decision Tree)(GBDT) 이진 분류기 모델일 수 있다.Since the main purpose of the classification model is to identify twins, the similarity classification model is a binary classifier model, specifically the Gradient Boosting Decision Tree (GBDT) to simplify and realize the embodiments of the present application. It may be a binary classifier model.

구현예에서, 포지티브 샘플 특징은 계산을 위해 분류 모델에 각각 입력될 수 있다. 획득된 계산 결과는 포지티브 샘플 특징에 대응하는 사용자 유사성과 비교될 수 있다. 둘이 서로 매칭하면, 다음 번의 포지티브 샘플 특징 또는 네거티브 샘플 특징이 선택되어 계산을 위해 분류 모델에 입력될 수 있다. 획득된 계산 결과는 포지티브 샘플 특징에 대응하는 사용자 유사성과 계속 비교된다. 둘이 매칭하지 않으면, 분류 모델의 관련된 파라미터가 조정될 수 있고, 그런 다음 포지티브 샘플 특징이 계산을 위해 조정된 분류 모델에 입력되고, 획득된 계산 결과가 포지티브 샘플 특징에 대응하는 사용자 유사성과 다시 비교된다. 절차는 둘이 서로 매칭할 때까지 반복된다. 위의 방법에 의해, 모든 포지티브 샘플 특징 및 모든 네거티브 샘플 특징이 계산을 위해 분류 모델에 입력될 수 있고, 이에 따라 분류 모델을 훈련시킬 수 있다. 훈련을 통해 획득된 최종 분류 모델은 유사성 분류 모델로서 사용될 수 있다.In implementations, the positive sample features may each be entered into a classification model for calculation. The obtained calculation result can be compared with user similarity corresponding to the positive sample feature. If the two match each other, the next positive sample feature or negative sample feature may be selected and entered into the classification model for calculation. The obtained calculation result is continuously compared with the user similarity corresponding to the positive sample feature. If the two do not match, the relevant parameters of the classification model may be adjusted, and then the positive sample feature is input into the adjusted classification model for calculation, and the obtained calculation result is again compared with the user similarity corresponding to the positive sample feature. The procedure is repeated until the two match each other. By the above method, all positive sample features and all negative sample features can be entered into the classification model for calculation, thereby training the classification model. The final classification model obtained through training can be used as a similarity classification model.

따라서, 유사성 분류 모델은 위의 처리 절차를 통해 획득된다. 유사성 분류 모델은 얼굴 인식 시나리오에 적용될 수 있다. 위험과 관련된 행동에 결부될 수 있는 쌍둥이 사용자의 경우, 유사성 분류 모델은 별도의 위험 제어에 사용될 수 있다.Thus, the similarity classification model is obtained through the above processing procedure. The similarity classification model can be applied to face recognition scenarios. For twin users who may be involved in risk-related behavior, the similarity classification model can be used for separate risk control.

도 5에 도시된 바와 같이, 유사성 분류 모델이 획득된 이후에, 유사성 분류 모델을 사용함으로써 검출 대상 사용자 데이터 쌍에 대응하는 검출 대상 사용자가 쌍둥이인 지가 결정될 수 있다. 특정 구현예에 대해서는 위의 단계(S414) 내지 단계(S420)에서 관련된 내용이 참조될 수 있다.As shown in FIG. 5, after the similarity classification model is obtained, it may be determined whether the detection target users corresponding to the detection target user data pairs are twins by using the similarity classification model. For specific embodiments, reference may be made to the related contents in the above steps (S414) to (S420).

단계(S414)에서, 검출 대상 사용자 데이터 쌍이 취득된다.In step S414, a detection target user data pair is obtained.

단계(S414)의 내용은 실시예 1의 단계(S202)의 내용과 동일하고, 단계(S414)의 특정 구현예에 대해서는 단계(S202)의 관련된 내용이 참조될 수 있으며, 세부 사항은 여기서 다시 설명되지 않는다.The content of step S414 is the same as the content of step S202 of the first embodiment, and for specific embodiments of step S414, reference may be made to the related content of step S202, and details are described herein again. It doesn't work.

단계(S416)에서, 검출 대상 사용자 특징을 획득하기 위해 검출 대상 사용자 데이터 쌍 내의 각 세트의 검출 대상 사용자 데이터에 대해 특징 추출이 수행된다.In step S416, feature extraction is performed on each set of detection target user data in the detection target user data pair to obtain the detection target user feature.

단계(S416)에서 검출 대상 사용자 특징을 획득하기 위해 검출 대상 사용자 데이터 쌍 내의 각 세트의 검출 대상 사용자 데이터에 대해 특징 추출을 수행하는 프로세스에 대해서는 위의 단계(S408)의 관련된 내용이 참조될 수 있다. 즉, 검출 대상 사용자 데이터로부터 추출된 특징은 이것으로 제한되는 것은 아니지만, 세대 등록 차원 특징, 이름 차원 특징, 사회적 특징 및 관심 특징 등을 포함할 수 있다. 단계(S408)의 관련된 내용이 참조될 수 있으며, 세부 사항은 여기서 다시 설명되지 않는다.For the process of performing feature extraction on each set of detection target user data in the detection target user data pair to obtain the detection target user feature in step S416, the related contents of step S408 may be referred to. . That is, the features extracted from the detection target user data may include, but are not limited to, household registration dimension features, name dimension features, social features, and features of interest. Reference may be made to the relevant contents of step S408, and details are not described herein again.

단계(S418)에서, 검출 대상 사용자 특징 및 미리 훈련된 유사성 분류 모델에 따라 검출 대상 사용자 데이터 쌍 내의 두 세트의 검출 대상 사용자 데이터에 대응하는 사용자 간의 유사성이 결정된다.In step S418, the similarity between the users corresponding to the two sets of the detected user data in the detected user data pair is determined according to the detected user feature and the pretrained similarity classification model.

단계(S418)의 내용은 실시예 1의 단계(S206)의 내용과 동일하고, 단계(S418)의 특정 구현예에 대해서는 단계(S206)의 관련된 내용이 참조될 수 있으며, 세부 사항은 여기서 다시 설명되지 않는다.The content of step S418 is the same as the content of step S206 of Embodiment 1, and for the specific implementation of step S418, reference may be made to the related content of step S206, and details are described herein again. It doesn't work.

단계(S420)에서, 검출 대상 사용자 데이터 쌍 내의 두 세트의 검출 대상 사용자 데이터에 대응하는 사용자 간의 유사성이 미리 결정된 유사성 분류 문턱치보다 크면, 검출 대상 사용자 데이터 쌍에 대응하는 검출 대상 사용자는 쌍둥이라고 결정된다.In step S420, if the similarity between the users corresponding to the two sets of the detected user data pairs in the detected user data pair is greater than the predetermined similarity classification threshold, it is determined that the detected users corresponding to the detected user data pair are twins. .

구현예에서, 출력된 쌍둥이 리스트는 타겟 사용자의 얼굴 인식의 사용에 영향을 주기 때문에, 사용하려면 높은 정확도의 유사성 분류 모델인 것이 바람직하다. 실제 적용에서, 유사성 분류 문턱치는 큰 값, 예를 들어 95 % 또는 97 %로 설정될 수 있다. 검출 대상 사용자 특징은 훈련된 유사성 분류 모델을 사용하여 예측되고 점수가 매겨진다. 점수 매김(scoring) 프로세스는 해당하는 사용자 데이터 쌍에 대응하는 사용자가 쌍둥이일 확률을 계산하는 것이다. 예를 들어, 확률이 80 %이면, 점수는 80이고; 확률이 90 %이면, 점수는 90이다. 점수가 높을수록, 사용자 데이터 쌍에 대응하는 사용자가 쌍둥이일 확률이 높다.In an embodiment, the outputted twin list affects the use of the facial recognition of the target user, and therefore is preferably a high accuracy similarity classification model to use. In practical applications, the similarity classification threshold may be set to a large value, for example 95% or 97%. The user features to be detected are predicted and scored using a trained similarity classification model. The scoring process is to calculate the probability that the users corresponding to the corresponding user data pair are twins. For example, if the probability is 80%, the score is 80; If the probability is 90%, the score is 90. The higher the score, the more likely the users corresponding to the user data pair are twins.

본 출원의 실시예는 복수의 사용자 데이터 쌍이 취득되는 데이터 유사성 결정 방법을 제공하며, 여기서 각각의 사용자 데이터 쌍 내의 두 세트의 사용자 데이터의 데이터 필드는 동일한 부분을 갖고; 각각의 사용자 데이터 쌍에 대응하는 사용자 유사성이 취득되고; 미리 설정된 분류 모델을 훈련시키기 위한 샘플 데이터가 결정되며; 그런 다음, 유사성 분류 모델을 획득하기 위해 분류 모델이 샘플 데이터에 기초하여 훈련되므로, 검출 대상 사용자 데이터 쌍 내의 두 세트의 검출 대상 사용자 데이터에 대응하는 사용자 간의 유사성이 유사성 분류 모델에 따라 차후에 결정될 수 있다. 이러한 방식으로, 복수의 사용자 데이터 쌍이 동일한 데이터 필드를 통해서만 획득되고, 미리 설정된 분류 모델을 훈련시키기 위한 샘플 데이터를 획득하기 위해 각각의 사용자 데이터 쌍 내의 두 세트의 사용자 데이터에 대응하는 사용자 간의 연관성이 사용자 유사성에 따라 결정되므로, 즉 샘플 데이터가 수작업 분류없이 획득될 수 있으므로, 모델의 빠른 훈련이 구현될 수 있고, 모델 훈련의 효율이 개선될 수 있으며, 자원 소비가 감소될 수 있다.Embodiments of the present application provide a method of determining data similarity in which a plurality of user data pairs are obtained, wherein two sets of user data data fields in each user data pair have the same portion; User similarity corresponding to each user data pair is obtained; Sample data for training a preset classification model is determined; Then, since the classification model is trained based on the sample data to obtain the similarity classification model, the similarity between the users corresponding to the two sets of detection user data in the detection user data pair may be later determined according to the similarity classification model. . In this way, a plurality of user data pairs are obtained only through the same data field, and the association between users corresponding to two sets of user data in each user data pair to obtain sample data for training a preset classification model. As determined by the similarity, that is, sample data can be obtained without manual classification, fast training of the model can be implemented, efficiency of model training can be improved, and resource consumption can be reduced.

실시예 3Example 3

본 출원의 실시예에 의해 제공되는 데이터 유사성 결정 방법이 위에서 설명된다. 동일한 개념에 기초하여, 본 출원의 실시예는 추가로 도 6에 도시된 바와 같은 모델 훈련 장치를 제공한다.The method of determining data similarity provided by the embodiments of the present application is described above. Based on the same concept, an embodiment of the present application further provides a model training apparatus as shown in FIG. 6.

모델 훈련 장치는 서버에 배치될 수 있다. 장치는 데이터 취득 모듈(601), 유사성 취득 모듈(602), 샘플 데이터 결정 모듈(603) 및 모델 훈련 모듈(604)을 포함하며, 여기서:The model training device may be deployed on a server. The apparatus includes a data acquisition module 601, a similarity acquisition module 602, a sample data determination module 603, and a model training module 604, where:

데이터 취득 모듈(601)은 복수의 사용자 데이터 쌍을 취득하도록 구성되며, 여기서 각각의 사용자 데이터 쌍 내의 두 세트의 사용자 데이터의 데이터 필드는 동일한 부분을 갖고;The data acquisition module 601 is configured to acquire a plurality of user data pairs, wherein the data fields of two sets of user data in each user data pair have the same portion;

유사성 취득 모듈(602)은 각각의 사용자 데이터 쌍에 대응하는 사용자 유사성을 취득하도록 구성되며, 여기서 사용자 유사성은 각각의 사용자 데이터 쌍 내의 두 세트의 사용자 데이터에 대응하는 사용자 간의 유사성이고;Similarity obtaining module 602 is configured to obtain user similarity corresponding to each user data pair, where the user similarity is similarity between users corresponding to two sets of user data in each user data pair;

샘플 데이터 결정 모듈(603)은 각각의 사용자 데이터 쌍에 대응하는 사용자 유사성 및 복수의 사용자 데이터 쌍에 따라, 미리 설정된 분류 모델을 훈련시키기 위한 샘플 데이터를 결정하도록 구성되며;The sample data determination module 603 is configured to determine, according to a user similarity corresponding to each user data pair and a plurality of user data pairs, sample data for training a preset classification model;

모델 훈련 모듈(604)은 유사성 분류 모델을 획득하기 위해 샘플 데이터에 기초하여 분류 모델을 훈련시키도록 구성된다.Model training module 604 is configured to train the classification model based on the sample data to obtain a similarity classification model.

본 출원의 실시예에서, 유사성 취득 모듈(602)은,In an embodiment of the present application, the similarity obtaining module 602 is,

본 출원의 실시예에서, 생물학적 특징은 얼굴 이미지 특징을 포함하고;In an embodiment of the present application, the biological feature comprises a face image feature;

본 출원의 실시예에서, 생물학적 특징은 음성 특징을 포함하고;In an embodiment of the present application, the biological feature comprises a negative feature;

본 출원의 실시예에서, 샘플 데이터 결정 모듈(603)은,In an embodiment of the present application, the sample data determination module 603,

각각의 사용자 데이터 쌍 내의 두 세트의 사용자 데이터 사이의 연관된 사용자 특징과 각각의 사용자 데이터 쌍에 대응하는 사용자 유사성에 따라, 분류 모델을 훈련시키기 위한 샘플 데이터를 결정하도록 구성된 샘플 데이터 결정 유닛을 포함한다.And a sample data determination unit configured to determine sample data for training the classification model, according to associated user characteristics between two sets of user data in each user data pair and user similarity corresponding to each user data pair.

본 출원의 실시예에서, 샘플 데이터 결정 유닛은 각각의 사용자 데이터 쌍에 대응하는 사용자 유사성 및 미리 결정된 유사성 문턱치에 따라 복수의 사용자 데이터 쌍에 대응하는 사용자 특징으로부터 포지티브 샘플 특징 및 네거티브 샘플 특징을 선택하고; 포지티브 샘플 특징 및 네거티브 샘플 특징을 분류 모델을 훈련시키기 위한 샘플 데이터로 사용하도록 구성된다.In an embodiment of the present application, the sample data determination unit selects a positive sample feature and a negative sample feature from user features corresponding to the plurality of user data pairs according to user similarities and predetermined similarity thresholds corresponding to each user data pair and ; The positive sample feature and the negative sample feature are configured to use as sample data for training the classification model.

본 출원의 실시예에서, 사용자 특징은 세대 등록 차원 특징, 이름 차원 특징, 사회적 특징 및 관심 특징을 포함하며, 여기서 세대 등록 차원 특징은 사용자 신원 정보의 특징을 포함하고, 이름 차원 특징은 사용자 이름 정보의 특징 및 사용자 성의 희소성 정도의 특징을 포함하며, 사회적 특징은 사용자의 사회적 관계 정보의 특징을 포함한다.In an embodiment of the present application, the user feature includes a household registration dimension feature, a name dimension feature, a social feature, and a feature of interest, wherein the generation registration dimension feature includes a feature of user identity information, and the name dimension feature includes user name information. The characteristics of the characteristics and the degree of scarcity of the user characteristics, and the social characteristics include the characteristics of the user's social relationship information.

본 출원의 실시예에서, 포지티브 샘플 특징은 네거티브 샘플 특징과 동일한 수량의 특징을 포함한다.In an embodiment of the present application, the positive sample feature includes the same quantity of features as the negative sample feature.

본 출원의 실시예에서, 유사성 분류 모델은 이진 분류기 모델이다.In an embodiment of the present application, the similarity classification model is a binary classifier model.

본 출원의 실시예는 복수의 사용자 데이터 쌍이 취득되는 모델 훈련 장치를 제공하며, 여기서 각각의 사용자 데이터 쌍 내의 두 세트의 사용자 데이터의 데이터 필드는 동일한 부분을 갖고; 각각의 사용자 데이터 쌍에 대응하는 사용자 유사성이 취득되고; 미리 설정된 분류 모델을 훈련시키기 위한 샘플 데이터가 결정되며; 그런 다음 유사성 분류 모델을 획득하기 위해 분류 모델이 샘플 데이터에 기초하여 훈련되므로, 검출 대상 사용자 데이터 쌍 내의 두 세트의 검출 대상 사용자 데이터에 대응하는 사용자 간의 유사성이 유사성 분류 모델에 따라 결정될 수 있다. 이러한 방식으로, 복수의 사용자 데이터 쌍이 동일한 데이터 필드를 통해서만 획득되고, 미리 설정된 분류 모델을 훈련시키기 위한 샘플 데이터를 획득하기 위해 각각의 사용자 데이터 쌍 내의 두 세트의 사용자 데이터에 대응하는 사용자 간의 연관성이 사용자 유사성에 따라 결정되므로, 즉 샘플 데이터가 수작업 분류없이 획득될 수 있으므로, 모델의 빠른 훈련이 구현될 수 있고, 모델 훈련의 효율이 개선될 수 있으며, 자원 소비가 감소될 수 있다.Embodiments of the present application provide a model training apparatus from which a plurality of user data pairs are obtained, wherein two sets of user data data fields in each user data pair have the same portion; User similarity corresponding to each user data pair is obtained; Sample data for training a preset classification model is determined; Then, since the classification model is trained based on the sample data to obtain the similarity classification model, the similarity between the users corresponding to the two sets of detection user data in the detection user data pair may be determined according to the similarity classification model. In this way, a plurality of user data pairs are obtained only through the same data field, and the association between users corresponding to two sets of user data in each user data pair to obtain sample data for training a preset classification model. As determined by the similarity, that is, sample data can be obtained without manual classification, fast training of the model can be implemented, efficiency of model training can be improved, and resource consumption can be reduced.

실시예 4Example 4

본 출원의 실시예에 의해 제공되는 모델 훈련 장치가 위에서 설명된다. 동일한 개념에 기초하여, 본 출원의 실시예는 추가로 도 7에 도시된 바와 같은 데이터 유사성 결정 장치를 제공한다.The model training apparatus provided by the embodiment of the present application is described above. Based on the same concept, an embodiment of the present application further provides an apparatus for determining data similarity as shown in FIG.

데이터 유사성 결정 장치는, 검출 대상 데이터 취득 모듈(701), 특징 추출 모듈(702) 및 유사성 결정 모듈(703)을 포함하고, 여기서:The data similarity determination apparatus includes a detection target data acquisition module 701, a feature extraction module 702, and a similarity determination module 703, where:

검출 대상 데이터 취득 모듈(701)은 검출 대상 사용자 데이터 쌍을 취득하도록 구성되고;The detection target data acquisition module 701 is configured to acquire the detection target user data pair;

특징 추출 모듈(702)은 검출 대상 사용자 데이터 쌍 내의 각 세트의 검출 대상 사용자 데이터에 대해 검출 대상 사용자 특징을 획득하도록 구성되고;The feature extraction module 702 is configured to obtain a detection target user feature for each set of detection target user data in the detection target user data pair;

유사성 결정 모듈(703)은 검출 대상 사용자 특징 및 미리 훈련된 유사성 분류 모델에 따라 검출 대상 사용자 데이터 쌍 내의 두 세트의 검출 대상 사용자 데이터에 대응하는 사용자 간의 유사성을 결정하도록 구성된다.The similarity determination module 703 is configured to determine the similarity between the users corresponding to the two sets of the detected user data in the detected user data pair according to the detected user feature and the pretrained similarity classification model.

본 출원의 실시예에서, 장치는,In an embodiment of the present application, the apparatus,

본 출원의 실시예는 복수의 사용자 데이터 쌍이 취득되는 데이터 유사성 결정 장치를 제공하며, 여기서 각각의 사용자 데이터 쌍 내의 두 세트의 사용자 데이터의 데이터 필드는 동일한 부분을 갖고; 각각의 사용자 데이터 쌍에 대응하는 사용자 유사성이 취득되고; 미리 설정된 분류 모델을 훈련시키기 위한 샘플 데이터가 결정되며; 그런 다음, 유사성 분류 모델을 획득하기 위해 분류 모델이 샘플 데이터에 기초하여 훈련되므로, 검출 대상 사용자 데이터 쌍 내의 두 세트의 검출 대상 사용자 데이터에 대응하는 사용자 간의 유사성이 유사성 분류 모델에 따라 결정될 수 있다. 이러한 방식으로, 복수의 사용자 데이터 쌍이 동일한 데이터 필드를 통해서만 획득되고, 미리 설정된 분류 모델을 훈련시키기 위한 샘플 데이터를 획득하기 위해 각각의 사용자 데이터 쌍 내의 두 세트의 사용자 데이터에 대응하는 사용자 간의 연관성이 사용자 유사성에 따라 결정되므로, 즉 샘플 데이터가 수작업 분류없이 획득될 수 있으므로, 모델의 빠른 훈련이 구현될 수 있고, 모델 훈련의 효율이 개선될 수 있으며, 자원 소비가 감소될 수 있다.Embodiments of the present application provide a data similarity determining apparatus in which a plurality of user data pairs are obtained, wherein two sets of user data data fields in each user data pair have the same portion; User similarity corresponding to each user data pair is obtained; Sample data for training a preset classification model is determined; Then, since the classification model is trained based on the sample data to obtain the similarity classification model, the similarity between the users corresponding to the two sets of detection user data in the detection user data pair may be determined according to the similarity classification model. In this way, a plurality of user data pairs are obtained only through the same data field, and the association between users corresponding to two sets of user data in each user data pair to obtain sample data for training a preset classification model. As determined by the similarity, that is, sample data can be obtained without manual classification, fast training of the model can be implemented, efficiency of model training can be improved, and resource consumption can be reduced.

실시예 5Example 5

동일한 개념에 기초하여, 본 출원의 실시예는 추가로 도 8에 도시된 바와 같은 모델 훈련 디바이스를 제공한다.Based on the same concept, an embodiment of the present application further provides a model training device as shown in FIG. 8.

모델 훈련 디바이스는 전술한 실시예에서 제공되는 서버 등일 수 있다.The model training device may be a server or the like provided in the above-described embodiment.

모델 훈련 디바이스는 상이한 구성 또는 성능에 따라 크게 상이할 수 있으며, 하나 이상의 프로세서(801) 및 메모리(802)를 포함할 수 있다. 메모리(802)는 하나 이상의 저장 애플리케이션 또는 데이터를 저장할 수 있다. 메모리(802)는 일시적 또는 영구적 저장소일 수 있다. 메모리(802)에 저장된 애플리케이션은 하나 이상의 모듈(도시되지 않음)을 포함할 수 있으며, 여기서 각각의 모듈은 모델 훈련 디바이스에서의 일련의 컴퓨터 실행 가능 명령어를 포함할 수 있다. 또한, 프로세서(801)는 메모리(802)와 통신하고, 모델 학습 디바이스 상에서 메모리(802) 내의 일련의 컴퓨터 실행 가능 명령어를 실행하도록 구성될 수 있다. 모델 훈련 디바이스는 하나 이상의 전원 공급 장치(803), 하나 이상의 유선 또는 무선 네트워크 인터페이스(804), 하나 이상의 입력/출력 인터페이스(805) 및 하나 이상의 키보드(806)를 더 포함할 수 있다.The model training device may vary greatly depending on different configurations or capabilities, and may include one or more processors 801 and memory 802. Memory 802 may store one or more storage applications or data. Memory 802 may be temporary or permanent storage. The application stored in memory 802 may include one or more modules (not shown), where each module may include a series of computer executable instructions on a model training device. In addition, the processor 801 may be configured to communicate with the memory 802 and execute a series of computer executable instructions in the memory 802 on the model learning device. The model training device may further include one or more power supplies 803, one or more wired or wireless network interfaces 804, one or more input / output interfaces 805, and one or more keyboards 806.

구체적으로, 이 실시예에서, 모델 훈련 디바이스는 메모리 및 하나 이상의 프로그램을 포함한다. 하나 이상의 프로그램은 메모리에 저장된다. 하나 이상의 프로그램은 하나 이상의 모듈을 포함할 수 있고, 각각의 모듈은 모델 훈련 디바이스에서의 일련의 컴퓨터 실행 가능 명령어를 포함할 수 있다. 하나 이상의 프로세서는 하나 이상의 프로그램을 실행하여,Specifically, in this embodiment, the model training device includes a memory and one or more programs. One or more programs are stored in memory. One or more programs may include one or more modules, each module comprising a series of computer executable instructions on a model training device. One or more processors execute one or more programs,

복수의 사용자 데이터 쌍을 취득하고 - 각각의 사용자 데이터 쌍 내의 두 세트의 사용자 데이터의 데이터 필드는 동일한 부분을 가짐 -;Acquire a plurality of user data pairs, the data fields of two sets of user data in each user data pair having the same portion;

각각의 사용자 데이터 쌍에 대응하는 사용자 유사성을 취득하고 - 사용자 유사성은 각각의 사용자 데이터 쌍 내의 두 세트의 사용자 데이터에 대응하는 사용자 간의 유사성임 -;Obtain user similarities corresponding to each user data pair, wherein the user similarities are similarities between users corresponding to two sets of user data in each user data pair;

각각의 사용자 데이터 쌍에 대응하는 사용자 유사성 및 복수의 사용자 데이터 쌍에 따라, 미리 설정된 분류 모델을 훈련시키기 위한 샘플 데이터를 결정하며;Determine, according to the user similarity and the plurality of user data pairs corresponding to each user data pair, sample data for training a preset classification model;

유사성 분류 모델을 획득하기 위해 샘플 데이터에 기초하여 분류 모델을 훈련시키는 컴퓨터 실행 가능 명령어를 실행하도록 구성된다.And execute computer executable instructions that train the classification model based on the sample data to obtain a similarity classification model.

선택적으로, 실행될 때, 실행 가능 명령어는 프로세서로 하여금 추가로,Optionally, when executed, the executable instruction further causes the processor to:

제1 사용자 데이터 쌍에 대응하는 사용자의 생물학적 특징을 획득하게 하고 - 제1 사용자 데이터 쌍은 복수의 사용자 데이터 쌍 중의 임의의 사용자 데이터 쌍임 -; 및Acquire biological characteristics of a user corresponding to the first user data pair, wherein the first user data pair is any user data pair of the plurality of user data pairs; And

제1 사용자 데이터 쌍에 대응하는 사용자의 생물학적 특징에 따라 제1 사용자 데이터 쌍에 대응하는 사용자 유사성을 결정하게 할 수 있다.Determine user similarity corresponding to the first user data pair according to a biological characteristic of the user corresponding to the first user data pair.

선택적으로, 실행될 때, 실행 가능 명령어는 프로세서로 하여금 추가로 다음과 같은 방식으로 동작하게 할 수 있는데, 즉,Optionally, when executed, the executable instruction may cause the processor to additionally operate in the following manner, that is,

생물학적 특징은 얼굴 이미지 특징을 포함하고;Biological features include facial image features;

제1 사용자 데이터 쌍에 대응하는 사용자의 생물학적 특징을 취득하는 것은:Acquiring a biological characteristic of a user corresponding to the first user data pair is:

제1 사용자 데이터 쌍에 대응하는 사용자의 얼굴 이미지를 취득하는 것; 및Acquiring a face image of a user corresponding to the first user data pair; And

얼굴 이미지 특징을 획득하기 위해 얼굴 이미지에 대해 특징 추출을 수행하는 것을 포함하고,Performing feature extraction on the face image to obtain a face image feature,

이에 대응하여, 제1 사용자 데이터 쌍에 대응하는 사용자의 생물학적 특징에 따라 제1 사용자 데이터 쌍에 대응하는 사용자 유사성을 결정하는 것은:Correspondingly, determining the user similarity corresponding to the first user data pair according to the biological characteristics of the user corresponding to the first user data pair:

제1 사용자 데이터 쌍에 대응하는 사용자의 얼굴 이미지 특징에 따라 제1 사용자 데이터 쌍에 대응하는 사용자 유사성을 결정하는 것을 포함한다.Determining user similarity corresponding to the first user data pair according to a facial image feature of the user corresponding to the first user data pair.

생물학적 특징은 음성 특징을 포함하고;Biological features include negative features;

제1 사용자 데이터 쌍에 대응하는 사용자의 음성 데이터를 취득하는 것; 및Acquiring voice data of a user corresponding to the first user data pair; And

음성 특징을 획득하기 위해 음성 데이터에 대해 특징 추출을 수행하는 것을 포함하며;Performing feature extraction on speech data to obtain speech features;

이에 대응하여, 제1 사용자 데이터 쌍에 대응하는 사용자의 생물학적 특징에 따라 제1 사용자 데이터 쌍에 대응하는 사용자 유사성을 결정하는 것은,Correspondingly, determining user similarity corresponding to the first user data pair according to the biological characteristics of the user corresponding to the first user data pair,

제1 사용자 데이터 쌍에 대응하는 사용자의 음성 특징에 따라 제1 사용자 데이터 쌍에 대응하는 사용자 유사성을 결정하는 것을 포함한다.Determining user similarity corresponding to the first user data pair according to a voice characteristic of the user corresponding to the first user data pair.

선택적으로, 실행될 때, 실행 가능 명령어는 프로세서로 하여금 추가로:Optionally, when executed, the executable instruction further causes the processor to:

각각의 사용자 데이터 쌍의 두 세트의 사용자 데이터 사이에서 연관된 사용자 특징을 획득하기 위해 복수의 사용자 데이터 쌍의 각각의 사용자 데이터 쌍에 대해 특징 추출을 수행하게 하고;Perform feature extraction on each user data pair of the plurality of user data pairs to obtain an associated user feature between the two sets of user data of each user data pair;

각각의 사용자 데이터 쌍 내의 두 세트의 사용자 데이터 사이의 연관된 사용자 특징 및 각각의 사용자 데이터 쌍에 대응하는 사용자 유사성에 따라, 분류 모델을 훈련시키기 위한 샘플 데이터를 결정하게 할 수 있다.Depending on the associated user characteristics between the two sets of user data in each user data pair and the user similarity corresponding to each user data pair, it may be possible to determine sample data for training the classification model.

각각의 사용자 데이터 쌍에 대응하는 사용자 유사성 및 미리 결정된 유사성 문턱치에 따라 복수의 사용자 데이터 쌍에 대응하는 사용자 특징으로부터 포지티브 샘플 특징 및 네거티브 샘플 특징을 선택하게 하고;Select a positive sample feature and a negative sample feature from user features corresponding to the plurality of user data pairs according to user similarities and predetermined similarity thresholds corresponding to each user data pair;

포지티브 샘플 특징 및 네거티브 샘플 특징을 분류 모델을 훈련시키기 위한 샘플 데이터로 사용하게 할 수 있다.Positive sample features and negative sample features can be used as sample data for training classification models.

선택적으로, 사용자 특징은 세대 등록 차원 특징, 이름 차원 특징, 사회적 특징 및 관심 특징을 포함하며, 여기서 세대 등록 차원 특징은 사용자 신원 정보의 특징을 포함하고, 이름 차원 특징은 사용자 이름 정보의 특징 및 사용자 성의 희소성 정도의 특징을 포함하며, 사회적 특징은 사용자의 사회적 관계 정보의 특징을 포함한다.Optionally, the user feature includes a household registration dimension feature, a name dimension feature, a social feature, and a feature of interest, wherein the household registration dimension feature includes a feature of user identity information, wherein the name dimension feature is a feature of the user name information and the user Includes features of the degree of scarcity of gender, and social features include features of the user's social relationship information.

본 출원의 실시예는 복수의 사용자 데이터 쌍이 취득되는 모델 훈련 디바이스를 제공하며, 여기서 각각의 사용자 데이터 쌍 내의 두 세트의 사용자 데이터의 데이터 필드는 동일한 부분을 갖고; 각각의 사용자 데이터 쌍에 대응하는 사용자 유사성이 취득되고; 미리 설정된 분류 모델을 훈련시키기 위한 샘플 데이터가 결정되며; 그런 다음, 유사성 분류 모델을 획득하기 위해 분류 모델이 샘플 데이터에 기초하여 훈련되므로, 검출 대상 사용자 데이터 쌍 내의 두 세트의 검출 대상 사용자 데이터에 대응하는 사용자 간의 유사성이 유사성 분류 모델에 따라 결정될 수 있다. 이러한 방식으로, 복수의 사용자 데이터 쌍이 동일한 데이터 필드를 통해서만 획득되고, 미리 설정된 분류 모델을 훈련시키기 위한 샘플 데이터를 획득하기 위해 각각의 사용자 데이터 쌍 내의 두 세트의 사용자 데이터에 대응하는 사용자 간의 연관성이 사용자 유사성에 따라 결정되므로, 즉 샘플 데이터가 수작업 분류없이 획득될 수 있으므로, 모델의 빠른 훈련이 구현될 수 있고, 모델 훈련의 효율이 개선될 수 있으며, 자원 소비가 감소될 수 있다.Embodiments of the present application provide a model training device from which a plurality of user data pairs are obtained, wherein the data fields of two sets of user data in each user data pair have the same portion; User similarity corresponding to each user data pair is obtained; Sample data for training a preset classification model is determined; Then, since the classification model is trained based on the sample data to obtain the similarity classification model, the similarity between the users corresponding to the two sets of detection user data in the detection user data pair may be determined according to the similarity classification model. In this way, a plurality of user data pairs are obtained only through the same data field, and the association between users corresponding to two sets of user data in each user data pair to obtain sample data for training a preset classification model. As determined by the similarity, that is, sample data can be obtained without manual classification, fast training of the model can be implemented, efficiency of model training can be improved, and resource consumption can be reduced.

실시예 6Example 6

동일한 개념에 기초하여, 본 출원의 실시예는 추가로 도 9에 도시된 바와 같은 데이터 유사성 결정 디바이스를 제공한다.Based on the same concept, an embodiment of the present application further provides a data similarity determination device as shown in FIG. 9.

데이터 유사성 결정 디바이스는 전술한 실시예에서 제공되는 서버, 단말기 디바이스 등일 수 있다.The data similarity determining device may be a server, a terminal device, or the like provided in the above-described embodiment.

데이터 유사성 결정 디바이스는 상이한 구성 또는 성능에 따라 크게 상이할 수 있으며, 하나 이상의 프로세서(901) 및 메모리(902)를 포함할 수 있다. 메모리(902)는 하나 이상의 저장 애플리케이션 또는 데이터를 저장할 수 있다. 메모리(902)는 일시적 또는 영구적 저장소일 수 있다. 메모리(902)에 저장된 애플리케이션은 하나 이상의 모듈(도시되지 않음)을 포함할 수 있으며, 여기서 각각의 모듈은 데이터 유사성 결정 디바이스에서의 일련의 컴퓨터 실행 가능 명령어를 포함할 수 있다. 또한, 프로세서(901)는 메모리(902)와 통신하고, 데이터 유사성 결정 디바이스 상에서 메모리(902) 내의 일련의 컴퓨터 실행 가능 명령어를 실행하도록 구성될 수 있다. 데이터 유사성 결정 디바이스는 하나 이상의 전원 공급 장치(903), 하나 이상의 유선 또는 무선 네트워크 인터페이스(904), 하나 이상의 입력/출력 인터페이스(905) 및 하나 이상의 키보드(906)를 더 포함할 수 있다.The data similarity determining device can vary greatly depending on different configurations or capabilities, and can include one or more processors 901 and memory 902. Memory 902 may store one or more storage applications or data. Memory 902 may be temporary or permanent storage. The application stored in memory 902 may include one or more modules (not shown), where each module may include a series of computer executable instructions at a data similarity determining device. In addition, the processor 901 may be configured to communicate with the memory 902 and to execute a series of computer executable instructions within the memory 902 on the data similarity determination device. The data similarity determining device may further include one or more power supplies 903, one or more wired or wireless network interfaces 904, one or more input / output interfaces 905, and one or more keyboards 906.

구체적으로, 이 실시예에서, 데이터 유사성 결정 디바이스는 메모리 및 하나 이상의 프로그램을 포함한다. 하나 이상의 프로그램은 메모리에 저장된다. 하나 이상의 프로그램은 하나 이상의 모듈을 포함할 수 있고, 각각의 모듈은 데이터 유사성 결정 디바이스에서의 일련의 컴퓨터 실행 가능 명령어를 포함할 수 있다. 하나 이상의 프로세서는 하나 이상의 프로그램을 실행하여 다음과 같은 컴퓨터 실행 가능 명령어를 실행하도록 구성된다:Specifically, in this embodiment, the data similarity determining device includes a memory and one or more programs. One or more programs are stored in memory. One or more programs may include one or more modules, each module comprising a series of computer executable instructions on a data similarity determining device. One or more processors are configured to execute one or more programs to execute computer executable instructions, such as:

검출 대상 사용자 데이터 쌍 내의 각각의 세트의 검출 대상 사용자 데이터에 대해 특징 추출을 수행하는 것;Performing feature extraction on each set of detection target user data in the detection target user data pair;

검출 대상 사용자 특징 및 미리 훈련된 유사성 분류 모델에 따라 검출 대상 사용자 데이터 쌍 내의 두 세트의 검출 대상 사용자 데이터에 대응하는 사용자 간의 유사성을 결정하는 것. Determining similarity between the users corresponding to the two sets of detected user data in the detected user data pair according to the detected user feature and the pre-trained similarity classification model.

검출 대상 사용자 데이터 쌍 내의 두 세트의 검출 대상 사용자 데이터에 대응하는 사용자 간의 유사성이 미리 결정된 유사성 분류 문턱치보다 크면 검출 대상 사용자 데이터 쌍에 대응하는 검출 대상 사용자를 쌍둥이라고 결정하게 할 수 있다.If the similarity between two users corresponding to the two sets of detection user data in the detection user data pair is larger than a predetermined similarity classification threshold, the detection user corresponding to the detection user data pair may be determined to be twins.

본 출원의 실시예는 복수의 사용자 데이터 쌍이 취득되는 데이터 유사성 결정 디바이스를 제공하며, 여기서 각각의 사용자 데이터 쌍 내의 두 세트의 사용자 데이터의 데이터 필드는 동일한 부분을 갖고; 각각의 사용자 데이터 쌍에 대응하는 사용자 유사성이 취득되고; 미리 설정된 분류 모델을 훈련시키기 위한 샘플 데이터가 결정되며; 그런 다음, 유사성 분류 모델을 획득하기 위해 분류 모델이 샘플 데이터에 기초하여 훈련되므로, 검출 대상 사용자 데이터 쌍 내의 두 세트의 검출 대상 사용자 데이터에 대응하는 사용자 간의 유사성이 유사성 분류 모델에 따라 결정될 수 있다. 이러한 방식으로, 복수의 사용자 데이터 쌍이 동일한 데이터 필드를 통해서만 획득되고, 미리 설정된 분류 모델을 훈련시키기 위한 샘플 데이터를 획득하기 위해 각각의 사용자 데이터 쌍 내의 두 세트의 사용자 데이터에 대응하는 사용자 간의 연관성이 사용자 유사성에 따라 결정되므로, 즉 샘플 데이터가 수작업 분류없이 획득될 수 있으므로, 모델의 빠른 훈련이 구현될 수 있고, 모델 훈련의 효율이 개선될 수 있으며, 자원 소비가 감소될 수 있다.Embodiments of the present application provide a data similarity determining device from which a plurality of user data pairs are obtained, wherein the data fields of two sets of user data in each user data pair have the same portion; User similarity corresponding to each user data pair is obtained; Sample data for training a preset classification model is determined; Then, since the classification model is trained based on the sample data to obtain the similarity classification model, the similarity between the users corresponding to the two sets of detection user data in the detection user data pair may be determined according to the similarity classification model. In this way, a plurality of user data pairs are obtained only through the same data field, and the association between users corresponding to two sets of user data in each user data pair to obtain sample data for training a preset classification model. As determined by the similarity, that is, sample data can be obtained without manual classification, fast training of the model can be implemented, efficiency of model training can be improved, and resource consumption can be reduced.

본 명세서의 특정 실시예가 위에서 설명되었다. 다른 실시예는 첨부된 청구 범위의 범위 내에 있다. 일부 경우에, 청구 범위에 제시된 동작 또는 단계는 실시예에서 설명된 것과 상이한 순서로 수행될 수 있으며 그럼에도 원하는 결과를 달성할 수 있다. 또한, 도면에 도시된 프로세스는 원하는 결과를 달성하기 위해 반드시 도시된 특정 순서 또는 연속적인 순서일 필요는 없다. 일부 실시예에서, 멀티태스킹 및 병렬 처리가 또한 가능하거나 유리할 수 있다.Certain embodiments of the present specification have been described above. Other embodiments are within the scope of the appended claims. In some cases, the actions or steps set forth in the claims may be performed in a different order than that described in the embodiments and may still achieve the desired results. In addition, the processes depicted in the figures do not necessarily have to be in the specific order or sequential order shown to achieve the desired result. In some embodiments, multitasking and parallel processing may also be possible or advantageous.

1990 년대에는 기술에 관한 개선이 하드웨어에 관한 개선(예를 들면, 다이오드, 트랜지스터 및 스위치와 같은 회로 구조에 관한 개선) 또는 소프트웨어에 관한 개선(방법 절차에 관한 개선)으로서 분명하게 구별될 수 있다. 그러나, 기술의 발전에 따라, 현재 많은 방법 절차의 개선은 하드웨어 회로 구조에 관한 직접적인 개선으로 간주될 수 있다. 대부분의 모든 설계자는 개선된 방법 절차를 하드웨어 회로에 프로그램하여 대응하는 하드웨어 회로 구조를 획득한다. 따라서, 방법 절차의 개선이 물질적 하드웨어 모듈을 사용하여 구현할 수 없다고 가정하는 것은 부적절하다. 예를 들어, 프로그래머블 로직 디바이스(programmable logic device)(PLD)(예를 들어, 필드 프로그래머블 게이트 어레이(field programmable gate array)(FPGA))는 사용자에 의해 프로그램된 디바이스에 의해 로직 함수가 결정되는 집적 회로이다. 설계자는 칩 제조업체에게 전용의 집적 회로 칩을 설계하고 제조하도록 요청할 필요없이 자력으로 디지털 시스템을 한 조각의 PLD에 "통합"하도록 프로그램한다. 더욱이, 현재, 프로그래밍은 집적 회로 칩을 수작업으로 제조하는 대신, 대부분은 로직 컴파일러 소프트웨어를 사용하여 구현된다. 로직 컴파일러 소프트웨어는 프로그램을 개발하고 작성하는 데 사용되는 소프트웨어 컴파일러와 유사하며, 원본 코드는 컴파일하기 전에 하드웨어 서술 언어(Hardware Description Language)(HDL)로 지칭되는 특정 프로그래밍 언어를 사용하여 작성되어야 한다. 많은 유형의 HDL들, 예컨대, ABEL(Advanced Boolean Expression Language), AHDL(Altera Hardware Description Language), 컨플루언스(Confluence), CUPL(Cornell University Programming Language), HDCal, JHDL(Java Hardware Description Language), 라버(Lava), 롤라(Lola), MyHDL, PALASM 및 RHDL(Ruby Hardware Description Language)이 있고, 그 중에서 VHDL(Very-High-Speed Integrated Circuit Hardware Description Language) 및 베릴로그(Verilog)가 현재 가장 일반적으로 사용된다. 관련 기술분야에서 통상의 기술자는 로직 방법 절차를 구현하기 위한 하드웨어 회로가 위의 몇몇 하드웨어 서술 언어들을 사용하여 방법 절차를 약간 논리적으로 프로그래밍하고 이를 집적 회로 내에 프로그래밍함으로써 용이하게 획득될 수 있다는 것을 또한 알고 있어야 한다.In the 1990s, improvements in technology could be clearly distinguished as improvements in hardware (e.g., improvements in circuit structures such as diodes, transistors, and switches) or improvements in software (improvements in method procedures). However, with advances in technology, improvements in many current method procedures can now be regarded as direct improvements in hardware circuit structure. Most all designers program the improved method procedures into hardware circuits to obtain corresponding hardware circuit structures. Thus, it is inappropriate to assume that improvements in method procedures cannot be implemented using material hardware modules. For example, a programmable logic device (PLD) (e.g., a field programmable gate array (FPGA)) is an integrated circuit whose logic function is determined by a device programmed by a user. to be. Designers program themselves to "integrate" digital systems into a single piece of PLD without having to ask the chip manufacturer to design and manufacture a dedicated integrated circuit chip. Moreover, at present, programming is mostly implemented using logic compiler software, instead of manually manufacturing integrated circuit chips. Logic compiler software is similar to the software compiler used to develop and write programs, and the original code must be written using a specific programming language called Hardware Description Language (HDL) before compiling. Many types of HDLs, such as Advanced Boolean Expression Language (ABEL), Altera Hardware Description Language (AHDL), Confluence, Cornell University Programming Language (CUPL), HDCal, Java Hardware Description Language (JHDL), Raver (Lava), Lola, MyHDL, PALASM, and Ruby Hardware Description Language (RHDL), among which the most commonly used is High-Speed Integrated Circuit Hardware Description Language (VHDL) and Verilog. do. One of ordinary skill in the art also knows that a hardware circuit for implementing a logic method procedure can be easily obtained by programming the method procedure slightly logically and programming it in an integrated circuit using some of the hardware description languages above. Should be

컨트롤러는 임의의 적합한 방식으로 구현될 수 있다. 예를 들어, 컨트롤러는 예를 들어 마이크로프로세서 또는 프로세서, 및 (마이크로)프로세서, 로직 게이트, 스위치, 주문형 집적 회로(Application Specific Integrated Circuit)(ASIC), 프로그래머블 로직 컨트롤러 및 임베디드 마이크로컨트롤러에 의해 실행 가능한 컴퓨터 판독 가능 프로그램 코드(예를 들어, 소프트웨어 또는 펌웨어)를 저장하는 컴퓨터 판독 가능 매체의 형태일 수 있다. 컨트롤러의 예는 이것으로 제한되는 것은 아니지만, 다음과 같은 마이크로컨트롤러를 포함한다: ARC 625D, Atmel AT91SAM, 마이크로칩 PIC18F26K20 및 실리콘 랩 C8051F320. 메모리 컨트롤러는 또한 메모리의 제어 로직의 일부로서 구현될 수 있다. 관련 기술분야에서 통상의 기술자는 컨트롤러가 순수 컴퓨터 판독 가능 프로그램 코드만을 사용하여 구현될 수 있고, 또한 방법 단계가 컨트롤러로 하여금 동일한 기능을 로직 게이트, 스위치, 주문형 집적 회로, 프로그래머블 로직 컨트롤러, 및 임베디드 마이크로컨트롤러의 형태로 구현할 수 있도록 논리적으로 프로그램될 수 있다는 것을 또한 알고 있다. 따라서, 이러한 유형의 컨트롤러는 하드웨어 컴포넌트로서 간주될 수 있고, 다양한 기능을 구현하기 위해 본 명세서에 포함된 장치는 또한 하드웨어 컴포넌트 내부의 구조체로서 간주될 수 있다. 또는, 다양한 기능을 구현하기 위해 구성된 장치 조차도 방법을 구현하기 위한 소프트웨어 모듈 및 하드웨어 컴포넌트 내부의 구조체의 둘 다로 간주될 수 있다.The controller can be implemented in any suitable manner. For example, the controller may be a computer executable by, for example, a microprocessor or processor, and (micro) processors, logic gates, switches, application specific integrated circuits (ASICs), programmable logic controllers, and embedded microcontrollers. Computer-readable media that stores readable program code (eg, software or firmware). Examples of controllers include, but are not limited to, the following microcontrollers: ARC 625D, Atmel AT91SAM, Microchip PIC18F26K20 and Silicon Labs C8051F320. The memory controller may also be implemented as part of the control logic of the memory. Those skilled in the art will appreciate that the controller may be implemented using only pure computer readable program code, and that the method steps allow the controller to perform the same functions as logic gates, switches, application specific integrated circuits, programmable logic controllers, and embedded microcontrollers. It also knows that it can be logically programmed to implement in the form of a controller. Thus, this type of controller can be considered as a hardware component, and the apparatus included herein to implement various functions can also be considered as a structure inside the hardware component. Or, even an apparatus configured to implement various functions may be regarded as both a software module for implementing a method and a structure inside a hardware component.

위의 실시예에 도시된 시스템, 장치, 모듈 또는 유닛은 컴퓨터 칩이나 물질적 개체, 또는 특정 기능을 갖는 제품을 사용하여 구체적으로 구현될 수 있다. 전형적인 구현 디바이스는 컴퓨터이다. 구체적으로, 컴퓨터는 예를 들어, 퍼스널 컴퓨터, 랩톱 컴퓨터, 셀룰러 폰, 카메라 폰, 스마트 폰, 개인 휴대 정보 단말기, 미디어 플레이어, 네비게이션 디바이스, 이메일 디바이스, 게임 콘솔, 태블릿 컴퓨터 또는 웨어러블 디바이스 또는 임의의 이러한 디바이스의 조합일 수 있다.The systems, devices, modules or units shown in the above embodiments may be specifically implemented using computer chips or material objects or products with specific functions. Typical implementation device is a computer. Specifically, the computer is, for example, a personal computer, laptop computer, cellular phone, camera phone, smart phone, personal digital assistant, media player, navigation device, email device, game console, tablet computer or wearable device or any such It can be a combination of devices.

용이한 설명을 위해, 위의 장치가 설명될 때, 장치는 각각의 설명을 위한 기능의 관점에서 다양한 유닛으로 분리된다. 분명히, 본 출원이 구현될 때, 유닛의 기능은 동일한 또는 다수의 조각의 소프트웨어 및/또는 하드웨어로 구현될 수 있다.For ease of explanation, when the above apparatus is described, the apparatus is divided into various units in terms of functions for each description. Clearly, when the present application is implemented, the functionality of the unit may be implemented in the same or multiple pieces of software and / or hardware.

관련 기술분야에서 통상의 기술자가 이해하는 바와 같이, 본 출원의 실시예는 방법, 시스템 또는 컴퓨터 프로그램 제품으로 구현될 수 있다. 따라서, 본 출원은 전체적으로 하드웨어 실시예, 전체적으로 소프트웨어 실시예 또는 소프트웨어와 하드웨어 양태를 조합한 실시예의 형태를 사용할 수 있다. 뿐만 아니라, 본 출원의 실시예는 컴퓨터에 의해 사용 가능한 프로그램 코드를 포함하는 (이것으로 제한되는 것은 아니지만, 자기 디스크 메모리, CD-ROM, 광학 메모리 등을 비롯한) 하나 이상의 컴퓨터에 의해 사용 가능한 저장 매체 상에 구현되는 컴퓨터 프로그램 제품의 형태를 사용할 수 있다.As will be appreciated by those skilled in the art, embodiments of the present application may be implemented as a method, system, or computer program product. Thus, the present application may employ the form of a hardware embodiment as a whole, a software embodiment as a whole, or an embodiment combining software and hardware aspects. In addition, embodiments of the present application may include a storage medium usable by one or more computers, including but not limited to, program code usable by a computer, including but not limited to magnetic disk memory, CD-ROM, optical memory, and the like. It may be in the form of a computer program product implemented on.

본 출원은 본 출원의 실시예에서 방법, 디바이스(시스템) 및 컴퓨터 프로그램 제품의 흐름도 및/또는 블록도를 참조하여 설명된다. 컴퓨터 프로그램 명령어는 흐름도 및/또는 블록도 내의 각 프로세스 및/또는 블록 및 흐름도 및/또는 블록도 내의 프로세스 및/또는 블록의 조합을 구현할 수 있다는 것을 이해하여야 한다. 이들 컴퓨터 프로그램 명령어는 범용 컴퓨터, 특수 목적 컴퓨터, 임베디드 프로세서 또는 다른 프로그래머블 데이터 프로세싱 디바이스의 프로세서에 제공되어 머신을 생성함으로써, 흐름도 내의 하나 이상의 프로세스 및/또는 블록도 내의 하나 이상의 블록에서 명시된 기능을 구현하도록 구성된 장치는 컴퓨터 또는 다른 프로그래머블 데이터 프로세싱 디바이스의 프로세서에 의해 실행되는 명령어를 사용하여 생성될 수 있다.This application is described with reference to flowcharts and / or block diagrams of methods, devices (systems) and computer program products in embodiments of the present application. It should be understood that computer program instructions may implement each process and / or block in the flowcharts and / or block diagrams and a combination of processes and / or blocks in the flowcharts and / or block diagrams. These computer program instructions may be provided to a general purpose computer, special purpose computer, embedded processor, or processor of another programmable data processing device to generate a machine to implement the specified functions in one or more processes in the flowcharts and / or one or more blocks in the block diagram. The configured apparatus may be generated using instructions executed by a processor of a computer or other programmable data processing device.

이들 컴퓨터 프로그램 명령어는 또한 컴퓨터 또는 다른 프로그래머블 데이터 프로세싱 디바이스를 명시된 방식으로 작동하도록 안내할 수 있는 컴퓨터 판독 가능 메모리에 저장되어, 컴퓨터 판독 가능 메모리에 저장된 명령어가 명령 장치를 포함하는 제품을 생성하도록 할 수 있으며, 여기서 명령 장치는 흐름도 내의 하나 이상의 프로세스 및/또는 블록도 내의 하나 이상의 블록에서 명시된 기능을 구현한다.These computer program instructions may also be stored in computer readable memory, which may direct a computer or other programmable data processing device to operate in a specified manner, such that the instructions stored in the computer readable memory may produce a product that includes the instruction device. Wherein the instruction device implements the functionality specified in one or more processes in the flowcharts and / or one or more blocks in the block diagrams.

이들 컴퓨터 프로그램 명령어는 또한 컴퓨터 또는 다른 프로그래머블 데이터 프로세싱 디바이스에 로딩되어, 일련의 동작 단계가 컴퓨터 또는 다른 프로그래머블 데이터 프로세싱 디바이스 상에서 수행되어 컴퓨터에 의해 구현된 프로세싱을 생성할 수 있으며, 컴퓨터 또는 다른 프로그래머블 데이터 프로세싱 디바이스 상에서 실행되는 명령어는 흐름도 내의 하나 이상의 프로세스 및/또는 블록도 내의 하나 이상의 블록에서 명시된 기능을 구현하기 위한 단계를 제공한다.These computer program instructions may also be loaded into a computer or other programmable data processing device such that a series of operating steps may be performed on the computer or other programmable data processing device to produce processing implemented by the computer, and the computer or other programmable data processing Instructions executed on the device provide steps for implementing the specified functionality in one or more processes in the flowcharts and / or one or more blocks in the block diagram.

전형적인 구성에서, 컴퓨팅 디바이스는 하나 이상의 중앙 프로세싱 유닛(central processing unit)(CPU), 입력/출력 인터페이스, 네트워크 인터페이스 및 메모리를 포함한다.In a typical configuration, the computing device includes one or more central processing units (CPUs), input / output interfaces, network interfaces and memory.

메모리는 다음과 같은 형태의 컴퓨터 판독 가능 매체: 휘발성 메모리, 랜덤 액세스 메모리(random access memory)(RAM) 및/또는 비 휘발성 메모리, 예를 들면, 판독 전용 메모리(read-only memory)(ROM) 또는 플래시 RAM을 포함할 수 있다. 메모리는 컴퓨터 판독 가능 매체의 예이다.The memory may be in the form of computer readable media: volatile memory, random access memory (RAM) and / or non-volatile memory, such as read-only memory (ROM) or It may include flash RAM. Memory is an example of a computer readable medium.

컴퓨터 판독 가능 매체는 휘발성 및 비 휘발성, 모바일 및 비 모바일 매체를 포함하고, 임의의 방법 또는 기술을 사용하여 정보를 저장할 수 있다. 정보는 컴퓨터 판독 가능 명령어, 데이터 구조체, 프로그램의 모듈 또는 다른 데이터일 수 있다. 컴퓨터의 저장 매체의 예는 이것으로 제한되는 것은 아니지만, 컴퓨터 액세스 가능한 정보를 저장하기 위해 사용될 수 있는, 상변화 메모리(phase change memory(PRAM),) 정적 랜덤 액세스 메모리(Static Random Access Memory)(SRAM), 동적 랜덤 액세스 메모리(Dynamic Random Access Memory)(DRAM), 다른 유형의 RAM, ROM, 전기적 소거 가능 프로그래머블 판독 전용 메모리(electrically erasable programmable read-only memory)(EEPROM), 플래시 메모리 또는 다른 메모리 기술, 콤팩트 디스크 판독 전용 메모리(compact disc read-only memory)(CD-ROM), 디지털 다용도 디스크(Digital Versatile Disk)(DVD) 또는 다른 광학 저장소, 카세트 테이프, 테이프 디스크 저장소 또는 다른 자기 저장 디바이스, 또는 임의의 다른 비 전송 매체를 포함한다. 본 명세서에서의 정의에 따르면, 컴퓨터 판독 가능 매체는 일시적 컴퓨터 판독 가능 매체(일시적 매체), 예를 들면, 변조된 데이터 신호 및 캐리어를 포함하지 않는다.Computer-readable media includes volatile and nonvolatile, mobile and non-mobile media, and can store information using any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media in a computer are, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), which can be used to store computer accessible information. ), Dynamic random access memory (DRAM), other types of RAM, ROM, electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technology, Compact disc read-only memory (CD-ROM), digital versatile disk (DVD) or other optical storage, cassette tape, tape disk storage or other magnetic storage device, or any Other non-transmission media. According to the definition herein, computer readable media does not include transitory computer readable media (transitory media), eg, modulated data signals and carriers.

"포함하다", "포함한다" 또는 이것의 다른 변형은 비 배타적인 포함을 망라하려는 것이며, 그래서 일련의 요소를 포함하는 프로세스, 방법, 제품 또는 디바이스는 그 요소를 포함할 뿐만 아니라, 분명하게 열거되지 않은 다른 요소를 포함하고, 또는 프로세스, 방법, 제품 또는 디바이스의 내재하는 요소를 더 포함한다는 것을 추가로 알아야 한다. 더 이상의 제한이 없는 경우, "하나/하나의 …을 포함하는"에 의해 정의되는 요소는 그 요소를 포함하는 프로세스, 방법, 제품 또는 디바이스가 다른 동일한 요소를 추가로 갖는 것을 배제하지 않는다.“Include”, “include” or other variations thereof are intended to encompass non-exclusive inclusions, so that a process, method, article or device comprising a series of elements not only includes the element, but also explicitly enumerates it. It should be further appreciated that it includes other elements that are not, or further includes inherent elements of a process, method, product or device. In the absence of further limitations, an element defined by “comprising one / one…” does not exclude that a process, method, article or device comprising that element further has another identical element.

본 출원은 컴퓨터에 의해 실행되는 컴퓨터 실행 가능한 명령어, 예를 들어, 프로그램 모듈의 일반적인 맥락에서 설명될 수 있다. 일반적으로, 프로그램 모듈은 특정 작업을 실행하거나 특정한 추상적 데이터 형태를 구현하기 위해 사용되는 루틴, 프로그램, 객체, 어셈블리, 데이터 구조체 등을 포함한다. 본 출원은 또한 분산 컴퓨팅 환경에서 구현될 수 있고, 분산 컴퓨터 환경에서, 작업은 통신 네트워크를 통해 연결된 원격 프로세싱 디바이스를 사용하여 실행된다. 분산 컴퓨팅 환경에서, 프로그램 모듈은 저장 디바이스를 포함하는 로컬 및 원격 컴퓨터 저장 매체에 위치할 수 있다.The present application may be described in the general context of computer-executable instructions, eg, program modules, executed by a computer. Generally, program modules include routines, programs, objects, assemblies, data structures, etc. that are used to perform particular tasks or to implement particular abstract data types. The present application can also be implemented in a distributed computing environment, where tasks are performed using remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including storage devices.

본 명세서의 실시예는 점진적으로 설명되며, 실시예의 동일하거나 유사한 부분이 서로 참조될 수 있으며, 각각의 실시예는 다른 실시예와 상이한 양태를 강조한다. 특히, 시스템 구현예는 기본적으로 방법 실시예와 유사하고, 그러므로 간략하게 설명된다. 관련된 부분에 대해서는 방법 실시예의 부분의 설명이 참조될 수 있다.Embodiments of the present specification are described gradually, and the same or similar parts of the embodiments may be referred to each other, and each embodiment emphasizes different aspects from the other embodiments. In particular, the system implementation is basically similar to the method embodiment and therefore briefly described. For related parts, reference may be made to the description of the parts of the method embodiments.

위의 설명은 본 출원의 구현예일뿐이며, 본 출원을 제한하려는 것은 아니다. 관련 기술분야에서 통상의 기술자는 본 출원에 다양한 수정 및 변경을 가할 수 있다. 본 출원의 사상 및 원리 내에서 이루어지는 임의의 수정, 동등한 대체 또는 개선은 본 출원의 청구 범위의 범위 내에 있다.The above description is only an embodiment of the present application and is not intended to limit the present application. Those skilled in the art can make various modifications and changes to the present application. Any modifications, equivalent substitutions, or improvements made within the spirit and principles of the present application are within the scope of the claims of the present application.

Claims

As a model training method,
Obtaining a plurality of user data pairs, the data fields of two sets of user data in each user data pair having the same portion;
Obtaining user similarity corresponding to each user data pair, wherein the user similarity is similarity between users corresponding to the two sets of user data in each user data pair;
Determining, according to the user similarity and the plurality of user data pairs corresponding to each user data pair, sample data for training a preset classification model; And
Training the classification model based on the sample data to obtain a similarity classification model.

The method of claim 1,
Acquiring user similarity corresponding to each pair of user data,
Acquiring a biological characteristic of a user corresponding to a first user data pair, wherein the first user data pair is any user data pair of the plurality of user data pairs; And
Determining user similarity corresponding to the first user data pair according to the biological characteristic of the user corresponding to the first user data pair.

The method of claim 2,
The biological feature comprises a face image feature;
Acquiring a biological characteristic of a user corresponding to the first user data pair includes:
Acquiring a face image of the user corresponding to the first user data pair; And
Performing feature extraction on the face image to obtain a face image feature;
Correspondingly, determining a user similarity corresponding to the first user data pair according to the biological characteristic of the user corresponding to the first user data pair:
Determining the user similarity corresponding to the first user data pair according to a facial image feature of the user corresponding to the first user data pair.

The method of claim 2,
The biological characteristic comprises a negative characteristic;
Acquiring a biological characteristic of a user corresponding to the first user data pair may include:
Acquiring voice data of the user corresponding to the first user data pair; And
Performing feature extraction on the speech data to obtain a speech feature;
Correspondingly, determining the user similarity corresponding to the first user data pair according to the biological characteristic of the user corresponding to the first user data pair,
Determining the user similarity corresponding to the first user data pair in accordance with a voice characteristic of the user corresponding to the first user data pair.

The method of claim 1,
Determining sample data for training a classification model according to the user similarity and the plurality of user data pairs corresponding to each user data pair,
Performing feature extraction on each user data pair in the plurality of user data pairs to obtain an associated user feature between the two sets of user data in each user data pair; And
Determining the sample data for training the classification model according to the associated user characteristics between the user data in each user data pair and the user similarity corresponding to each user data pair. Way.

The method of claim 5,
Determining, according to the associated user characteristics between the two sets of user data in each user data pair and the user similarity corresponding to each user data pair, the sample data for training the classification model,
Selecting a positive sample feature and a negative sample feature from user features corresponding to the plurality of user data pairs according to the user similarity and a predetermined similarity threshold corresponding to each user data pair; And
Using the positive sample feature and the negative sample feature as the sample data for training the classification model.

The method of claim 6,
The user feature includes a household registration dimension feature, a name dimension feature, a social feature, and a feature of interest, the household registration dimension feature includes a feature of user identity information, and the name dimension feature is a user name. A characteristic of the information and a degree of scarcity of the user's surname, wherein the social characteristic comprises a characteristic of the user's social relationship information.

The method of claim 6,
The positive sample feature comprises a quantity of features equal to the negative sample feature.

The method according to any one of claims 1 to 8,
The similarity classification model is a binary classifier model.

As a data similarity determination method,
Acquiring a to-be-detected user data pair;
Performing feature extraction on each set of detection user data in the detection user data pair to obtain a detection user feature; And
Determining similarity between the users corresponding to the two sets of detected user data in the detected user data pair according to the detected user feature and a pre-trained similarity classification model.

The method of claim 10,
The method,
If the similarity between the users corresponding to the two sets of detection user data in the detection user data pair is greater than a predetermined similarity classification threshold, determining the detection user corresponding to the detection user data pair as twin. Further comprising a data similarity determination method.

As a model training device,
A data acquisition module configured to acquire a plurality of user data pairs, the data fields of two sets of user data in each user data pair having the same portion;
A similarity obtaining module configured to obtain user similarity corresponding to each user data pair, wherein the user similarity is a similarity between users corresponding to the two sets of user data in each user data pair;
A sample data determination module, configured to determine sample data for training a preset classification model according to the user similarity and the plurality of user data pairs corresponding to each user data pair; And
And a model training module configured to train the classification model based on the sample data to obtain a similarity classification model.

The method of claim 12,
The similarity acquisition module,
A biological feature acquisition unit configured to acquire a biological feature of a user corresponding to a first user data pair, wherein the first user data pair is any user data pair of the plurality of user data pairs; And
And a similarity obtaining unit configured to determine a user similarity corresponding to the first user data pair according to the biological characteristic of the user corresponding to the first user data pair.

The method of claim 13,
The biological feature comprises a face image feature;
The biological feature acquisition unit acquires a face image of the user corresponding to the first user data pair; Perform feature extraction on the face image to obtain a face image feature;
Correspondingly, the similarity obtaining unit is configured to determine the user similarity corresponding to the first user data pair according to the face image feature of the user corresponding to the first user data pair.

The method of claim 13,
The biological characteristic comprises a negative characteristic;
The biological feature acquisition unit acquires voice data of the user corresponding to the first user data pair; Perform feature extraction on the speech data to obtain a speech feature;
Correspondingly, the similarity obtaining unit is configured to determine the user similarity corresponding to the first user data pair according to the voice characteristic of the user corresponding to the first user data pair.

The method of claim 12,
The sample data determination module,
A feature extraction unit configured to perform feature extraction on each user data pair in the plurality of user data pairs to obtain an associated user feature between the two sets of user data in each user data pair; And
A sample data determination unit, configured to determine the sample data for training the classification model according to the associated user characteristics between the two sets of user data in each user data pair and the user similarity corresponding to each user data pair. Including, model training device.

The method of claim 16,
The sample data determination unit selects a positive sample feature and a negative sample feature from user features corresponding to the plurality of user data pairs according to the user similarity and predetermined similarity threshold corresponding to each user data pair; And use the positive sample feature and the negative sample feature as the sample data for training the classification model.

The method of claim 17,
The user feature includes a household registration dimension feature, a name dimension feature, a social feature, and a feature of interest, the household registration dimension feature includes a feature of user identity information, and the name dimension feature includes a feature of user name information and A feature of a degree of scarcity, wherein the social feature comprises a feature of the user's social relationship information.

The method of claim 17,
And the positive sample feature comprises a quantity of features equal to the negative sample feature.

The method according to any one of claims 12 to 19,
The similarity classification model is a binary classifier model.

A device for determining data similarity,
A detection target data acquisition module configured to acquire a detection target user data pair;
A feature extraction module configured to perform feature extraction on each set of detection target user data in the detection target user data pair to obtain a detection target user feature; And
And a similarity determination module configured to determine similarity between users corresponding to the two sets of detected user data in the detected user data pair according to the detected user feature and a pre-trained similarity classification model. .

The method of claim 21,
The device,
If the similarity between the users corresponding to the two sets of detection target user data in the detection target user data pair is greater than a predetermined similarity classification threshold, determine that the detection target user corresponding to the detection target user data pair is twin. And a similarity classification module configured.

As a model training device,
A processor; And
A memory configured to store computer executable instructions,
When executed, the computer executable instructions cause the processor to perform the following actions:
Obtaining a plurality of user data pairs, wherein the data fields of two sets of user data in each user data pair have the same portion;
Obtaining user similarity corresponding to each user data pair, wherein the user similarity is similarity between users corresponding to the two sets of user data in each user data pair;
Determining, according to the user similarity and the plurality of user data pairs corresponding to each user data pair, sample data for training a preset classification model; And
Training the classification model based on the sample data to obtain a similarity classification model
A model training device, which makes it run.

A data similarity determining device,
A processor; And
A memory configured to store computer executable instructions,
When executed, the computer executable instructions cause the processor to perform the following actions:
Acquiring a detection target user data pair;
Performing feature extraction on each set of detection user data in the detection user data pair to obtain a detection user feature; And
Determining similarity between users corresponding to the two sets of detected user data in the detected user data pair according to the detected user feature and a pre-trained similarity classification model.
Device for determining similarity of data.