MIT News August 30, 2024 To improve data transparency and understanding of training language models on vast, diverse and inconsistently documented datasets an international team of researchers (USA – MIT, Harvard, UC Irvine, industry, University of Colorado, Olin College of Engineering, Carnegie Mellon University, and France, Canada) convened a multi-disciplinary effort between legal and machine learning experts to systematically audit and trace more than 1,800 text datasets. They developed tools and standards to trace the lineage of these datasets, including their source, creators, licenses and subsequent use. They found sharp divides in the composition and focus of data licensed for […]