@elduvelle @lf_araujo @JoeSondow @dalias the sad part being, if I'm recalling it well, that all the code hosted there (in public repos1) is already being used to train the dataset.1 they mentioned "publicly accessible repositories" two years ago, which means they weren't filtering the licenses. Also, I have found at the time, they suggested data that weren't available in any public repositories. I'm guessing they were also using some private repositories as well.