Hey everyone, looks like having actual "standards" and sticking to them works!
-
Hey everyone, looks like having actual "standards" and sticking to them works! Or we are just total jerks who will not blindly accept whatever a random person types in to a textbox. One or the other. Who can say, really?
The Importance of High-Signal Data Modern AI research has definitively proven that data quality is vastly more important than data quantity. Microsoft's landmark 2023 paper "Textbooks Are All You Need" (which introduced the phi-1 coding model) demonstrated that aggressively filtering out low-quality "noise" from training data leads to dramatically better coding models."
The Subsidy of Human Labor: Stack Overflow's notorious moderation policies—closing duplicates, downvoting broken code, and demanding minimal reproducible examples—did exactly this human-labor-intensive data filtering for over 15 years. Without this rigorous gatekeeping, LLMs would have ingested vast amounts of broken, insecure, or poorly formatted code, which would have severely degraded their baseline performance. The assertion that AI companies are "subsidized" by the unpaid labor of diligent forum moderators is a widely accepted critique in the fields of AI ethics and data provenance.
-
Hey everyone, looks like having actual "standards" and sticking to them works! Or we are just total jerks who will not blindly accept whatever a random person types in to a textbox. One or the other. Who can say, really?
The Importance of High-Signal Data Modern AI research has definitively proven that data quality is vastly more important than data quantity. Microsoft's landmark 2023 paper "Textbooks Are All You Need" (which introduced the phi-1 coding model) demonstrated that aggressively filtering out low-quality "noise" from training data leads to dramatically better coding models."
The Subsidy of Human Labor: Stack Overflow's notorious moderation policies—closing duplicates, downvoting broken code, and demanding minimal reproducible examples—did exactly this human-labor-intensive data filtering for over 15 years. Without this rigorous gatekeeping, LLMs would have ingested vast amounts of broken, insecure, or poorly formatted code, which would have severely degraded their baseline performance. The assertion that AI companies are "subsidized" by the unpaid labor of diligent forum moderators is a widely accepted critique in the fields of AI ethics and data provenance.
@codinghorror and that's why said #AIslop should be #banned as #WastefulCmputing and any research and data #OpenSourced!
-
System shared this topic