MIT News August 30, 2024
To improve data transparency and understanding of training language models on vast, diverse and inconsistently documented datasets an international team of researchers (USA – MIT, Harvard, UC Irvine, industry, University of Colorado, Olin College of Engineering, Carnegie Mellon University, and France, Canada) convened a multi-disciplinary effort between legal and machine learning experts to systematically audit and trace more than 1,800 text datasets. They developed tools and standards to trace the lineage of these datasets, including their source, creators, licenses and subsequent use. They found sharp divides in the composition and focus of data licensed for commercial use. Important categories including low-resource languages, creative tasks and new synthetic data tended to be restrictively licensed. They also observed frequent miscategorization of licenses on popular dataset hosting sites, with license omission rates of more than 70% and error rates of more than 50%. According to the researchers their work highlights a crisis in misattribution, informed use of popular datasets driving many recent breakthroughs, and the application of copyright law and fair use to finetuning data. They released their audit the Data Provenance Explorer www.dataprovenance.org … read more. Open Access TECHNICAL ARTICLEÂ