Reading Between the Lines: A dataset and a study on why some texts are tougher than others

Khallaf, Nouran; Eugeni, Carlo; Sharoff, Serge

计算机科学 > 计算与语言

arXiv:2501.01796 (cs)

[提交于 2025年1月3日 ]

标题：阅读文字之外：一个数据集和关于为何某些文本比其他文本更难的研究

标题： Reading Between the Lines: A dataset and a study on why some texts are tougher than others

Authors:Nouran Khallaf, Carlo Eugeni, Serge Sharoff

摘要：我们的研究旨在更好地理解什么使文本对有智力障碍的特定受众难以阅读，更具体地说，是那些在认知功能方面有局限的人，例如阅读和理解技能有限、智商低于70以及在概念领域有挑战的人。我们引入了一种困难标注方案，该方案基于心理学中的实证研究以及翻译研究中的研究。本文描述了注释数据集，主要来源于在线提供的平行文本（标准英语和易于阅读的英语翻译）。我们微调了四种不同的预训练Transformer模型，以执行多类分类任务，预测简化所需的策略。我们还研究了当该语言模型旨在预测句子难度时，解释其决策的可能性。资源可从 https://github.com/Nouran-Khallaf/why-tough 获取

摘要： Our research aims at better understanding what makes a text difficult to read for specific audiences with intellectual disabilities, more specifically, people who have limitations in cognitive functioning, such as reading and understanding skills, an IQ below 70, and challenges in conceptual domains. We introduce a scheme for the annotation of difficulties which is based on empirical research in psychology as well as on research in translation studies. The paper describes the annotated dataset, primarily derived from the parallel texts (standard English and Easy to Read English translations) made available online. we fine-tuned four different pre-trained transformer models to perform the task of multiclass classification to predict the strategies required for simplification. We also investigate the possibility to interpret the decisions of this language model when it is aimed at predicting the difficulty of sentences. The resources are available from https://github.com/Nouran-Khallaf/why-tough

评论：	发表于《人工智能、认知科学与自然语言处理交叉领域的写作辅助》WR-AI-CogS，2025年COLING会议，阿布扎比
主题：	计算与语言 (cs.CL)
引用方式：	arXiv:2501.01796 [cs.CL]
	(或者 arXiv:2501.01796v1 [cs.CL] 对于此版本)
	https://doi.org/10.48550/arXiv.2501.01796

提交历史

来自： Serge Sharoff [查看电子邮件]
[v1] 星期五， 2025 年 1 月 3 日 13:09:46 UTC (2,091 KB)

计算机科学 > 计算与语言

标题：阅读文字之外：一个数据集和关于为何某些文本比其他文本更难的研究

标题： Reading Between the Lines: A dataset and a study on why some texts are tougher than others

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 计算与语言

标题： 阅读文字之外：一个数据集和关于为何某些文本比其他文本更难的研究 显示英文标题

标题： Reading Between the Lines: A dataset and a study on why some texts are tougher than others

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题：阅读文字之外：一个数据集和关于为何某些文本比其他文本更难的研究