Impact of Data Breadth and Depth on Performance of Siamese Neural Network Model: Experiments with Three Keystroke Dynamic Datasets

Wahab, Ahmed Anu; Hou, Daqing; Cheng, Nadia; Huntley, Parker; Devlen, Charles

计算机科学 > 机器学习

arXiv:2501.07600 (cs)

[提交于 2025年1月10日 ]

标题：数据广度和深度对孪生神经网络模型性能的影响：三个键盘动态数据集的实验

标题： Impact of Data Breadth and Depth on Performance of Siamese Neural Network Model: Experiments with Three Keystroke Dynamic Datasets

Authors:Ahmed Anu Wahab, Daqing Hou, Nadia Cheng, Parker Huntley, Charles Devlen

摘要：深度学习模型，如孪生神经网络（SNN），在捕捉行为数据中的复杂模式方面显示出巨大的潜力。然而，数据集的广度（即受试者数量）和深度（例如每个受试者的训练样本数量）对这些模型性能的影响通常被非正式地假设，并且仍缺乏深入研究。为此，我们使用“特征空间”和“密度”的概念进行了广泛的实验，以指导并更深入地理解数据集广度和深度对三个公开可用的击键数据集（Aalto、CMU 和 Clarkson II）的影响。通过改变训练受试者数量、每个受试者的样本数量、每个样本中的数据量以及训练中使用的三元组数量，我们发现，当可行时，增加数据集的广度有助于训练出一个能够有效捕捉更多受试者间差异的模型。相反，我们发现数据集深度的影响程度取决于数据集的性质。自由文本数据集受到所有三个深度因素的影响；每个受试者的样本不足、序列长度、训练三元组和画廊样本大小，所有这些都可能导致模型训练不足。固定文本数据集受这些因素的影响较小，因此更容易创建一个训练良好的模型。这些发现揭示了在训练用于行为生物识别的深度学习模型时数据集广度和深度的重要性，并为设计更有效的认证系统提供了有价值的见解。

摘要： Deep learning models, such as the Siamese Neural Networks (SNN), have shown great potential in capturing the intricate patterns in behavioral data. However, the impacts of dataset breadth (i.e., the number of subjects) and depth (e.g., the amount of training samples per subject) on the performance of these models is often informally assumed, and remains under-explored. To this end, we have conducted extensive experiments using the concepts of "feature space" and "density" to guide and gain deeper understanding on the impact of dataset breadth and depth on three publicly available keystroke datasets (Aalto, CMU and Clarkson II). Through varying the number of training subjects, number of samples per subject, amount of data in each sample, and number of triplets used in training, we found that when feasible, increasing dataset breadth enables the training of a well-trained model that effectively captures more inter-subject variability. In contrast, we find that the extent of depth's impact from a dataset depends on the nature of the dataset. Free-text datasets are influenced by all three depth-wise factors; inadequate samples per subject, sequence length, training triplets and gallery sample size, which may all lead to an under-trained model. Fixed-text datasets are less affected by these factors, and as such make it easier to create a well-trained model. These findings shed light on the importance of dataset breadth and depth in training deep learning models for behavioral biometrics and provide valuable insights for designing more effective authentication systems.

评论：	19页，4图
主题：	机器学习 (cs.LG) ; 计算机视觉与模式识别 (cs.CV); 机器学习 (stat.ML)
引用方式：	arXiv:2501.07600 [cs.LG]
	(或者 arXiv:2501.07600v1 [cs.LG] 对于此版本)
	https://doi.org/10.48550/arXiv.2501.07600

提交历史

来自： Charles Devlen [查看电子邮件]
[v1] 星期五， 2025 年 1 月 10 日 17:06:46 UTC (1,033 KB)

计算机科学 > 机器学习

标题：数据广度和深度对孪生神经网络模型性能的影响：三个键盘动态数据集的实验

标题： Impact of Data Breadth and Depth on Performance of Siamese Neural Network Model: Experiments with Three Keystroke Dynamic Datasets

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 机器学习

标题： 数据广度和深度对孪生神经网络模型性能的影响：三个键盘动态数据集的实验 显示英文标题

标题： Impact of Data Breadth and Depth on Performance of Siamese Neural Network Model: Experiments with Three Keystroke Dynamic Datasets

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题：数据广度和深度对孪生神经网络模型性能的影响：三个键盘动态数据集的实验