To keep #OpenStreetMap.org up and running while we're being deluged by scrapers, we've blocked 320,000+ primarily residential IPv4 addresses in the last 24 hours (+ 100,000 IPv6) involved in scraping.
-
@osm_tech and we can tell the scrapers are AI built because a cursory glance at the documentation on the "coders" part would've prevented this problem.
@ClariNerd @osm_tech Because their IP ranges are increasingly being blocked by servers following their harmful scraping habits, AI companies are now releasing "browsers" so they can scrape from residential IPs instead and circumvent blocks. Oh, sorry, I meant "so they can empower users with AI insight in this new era of information".
-
@ClariNerd @osm_tech Because their IP ranges are increasingly being blocked by servers following their harmful scraping habits, AI companies are now releasing "browsers" so they can scrape from residential IPs instead and circumvent blocks. Oh, sorry, I meant "so they can empower users with AI insight in this new era of information".
-
@JonSaenzAgirre @osm_tech
The scrapers are DUMB.
They are not curated, have only basic maintenance, are built to gobble up ANYTHING textual they encounter, without respect, mercy or reason.Just collect meaningless data.
That’s the nature of the coveted LLMs: just statistics, no understanding, structure or meaning.
And greedy crooks in haste to make quick money just grab everything they can.
The AI bubble needs to pop really soon.
@vampirdaddy @osm_tech this seems a reasonable explanation. Quantity of bytes irrespective of sense. Thank you
-
@utf_7 It is madness, start here: https://www.openstreetmap.org/node/1 and keep going once you reach https://www.openstreetmap.org/node/10000000000, then start on ways, and relations
or just download the latest weekly export from planet.openstreetmap.org 
-
To keep #OpenStreetMap.org up and running while we're being deluged by scrapers, we've blocked 320,000+ primarily residential IPv4 addresses in the last 24 hours (+ 100,000 IPv6) involved in scraping.
If you need OSM data, please don't scrape the website - use the official downloads at https://planet.openstreetmap.org

#AI #Bots #Abuse -
@michel42 We'd like to share the IP address list, but unfortunately don't think we can due to legal concerns.
-
@felixcremer @utf_7 because you are looking at version 43 of the node which has been subject to redaction (licence change), vandalism, and simply buggy software over 20+ years https://www.openstreetmap.org/node/1/history#map=18/1.999999/2.000000
-
@felixcremer @utf_7 because you are looking at version 43 of the node which has been subject to redaction (licence change), vandalism, and simply buggy software over 20+ years https://www.openstreetmap.org/node/1/history#map=18/1.999999/2.000000
-
@felixcremer @utf_7 I didn't mention this, but should have: prior to OSM API 0.5 (October 2007) objects were not versioned, the original "node 1" was deleted prior to that date and therefore doesn't actually exist in the current OSM data at all. The current "node 1" is a reuse of the old id IIRC.
-
@zymurgic The website interface designed for humans is the main issue I believe. See also https://en.osm.town/@osm_tech/115974391032358572
So that's... stupidI'm not sure who hosts the main Overpass API instance, but I don't think it is the OpenStreetMap Foundation, so (while they probably do have similar challenges) it's not that we're talking about.
-
@felixcremer @utf_7 I didn't mention this, but should have: prior to OSM API 0.5 (October 2007) objects were not versioned, the original "node 1" was deleted prior to that date and therefore doesn't actually exist in the current OSM data at all. The current "node 1" is a reuse of the old id IIRC.
@simon @felixcremer til something about osm nodes. what distance are 2 neighboring nodes away? or does this vary of the resolution of the area. like on the high seas there are more miles away than in Detroit
-
@simon @felixcremer til something about osm nodes. what distance are 2 neighboring nodes away? or does this vary of the resolution of the area. like on the high seas there are more miles away than in Detroit
@utf_7 @felixcremer the easiest way to calculate this is to use the Haversine distance, see https://en.wikipedia.org/wiki/Haversine_formula
Outside of that nodes are placed where they are deemed necessary to replicate the geometry of the objects. Naturally a rendering on a map can smooth that out if the designer wants to (most don't though).
-
@ryanprior @olbohlen @HunterZ easiest is to just block http 1.1 requests for sites being hammered, since 99% of scrape requests I've seen have been with that protocol.
-
@arichtman WTF are Mull doing. Chrome, but no sec-ch-ua.
I'm not having much luck in finding their Android browser... I'm seeing Mullvad VPN, and the browser in alpha for win/mac/linux, but not android. Can you point me in the right direction?
Not going to dive into it now, but I'd like to save it for my records.
-
@HunterZ@mastodon.sdf.org @osm_tech@en.osm.town lots of mobile/desktop apps, browser extensions, and even IoT devices are paid by "residential proxy" companies to prey on their users by selling said users's connections to AI scrapers https://www.spamhaus.org/resource-hub/compromised/lets-talk-about-the-danger-of-residential-proxy-networks/
Until recently I mainly fought against residential proxys facilitating DDoS- and crawling-attacks.
Thus I did not have the access threat to internal systems on my radar.
I think that vector is under-reported:
Residential (i.e. software- or library-embedded) proxys on smartphones that are allowed into company networks.
-
R relay@relay.infosec.exchange shared this topic
️