The one-liner:

dd if=/dev/zero bs=1G count=10 | gzip -c > 10GB.gz

This is brilliant.

  • 👍Maximum Derek👍@discuss.tchncs.de
    link
    fedilink
    English
    arrow-up
    54
    ·
    9 hours ago

    When I was serving high volume sites (that were targeted by scrapers) I had a collection of files in CDN that contained nothing but the word “no” over and over. Scrapers who barely hit our detection thresholds saw all their requests go to the 50M version. Super aggressive scrapers got the 10G version. And the scripts that just wouldn’t stop got the 50G version.

    It didn’t move the needle on budget, but hopefully it cost them.

      • 👍Maximum Derek👍@discuss.tchncs.de
        link
        fedilink
        English
        arrow-up
        25
        ·
        7 hours ago

        Most often because they don’t download any of the css of external js files from the pages they scrape. But there are a lot of other patterns you can detect once you have their traffic logs loaded in a time series database. I used an ELK stack back in the day.

        • sugar_in_your_tea@sh.itjust.works
          link
          fedilink
          English
          arrow-up
          4
          ·
          7 hours ago

          That sounds like a lot of effort. Are there any tools that get like 80% of the way there? Like something I could plug into Caddy, nginx, or haproxy?

          • 👍Maximum Derek👍@discuss.tchncs.de
            link
            fedilink
            English
            arrow-up
            11
            ·
            6 hours ago

            My experience is with systems that handle nearly 1000 pageviews per second. We did use a spread of haproxy servers to handle routing and SNI, but they were being fed offender lists by external analysis tools (built in-house).

            • sugar_in_your_tea@sh.itjust.works
              link
              fedilink
              English
              arrow-up
              4
              ·
              6 hours ago

              Dang, I was hoping for a FOSS project that would do most of the heavy lifting for me. Maybe such a thing exists, idk, but it would be pretty cool to have a pluggable system that analyzes activity and tags connections w/ some kind of identifier so I could configure a web server to either send it nonsense (i.e. poison AI scrapers), zip bombs (i.e. bots that aren’t respectful of resources), or redirect to a honey pot (i.e. malicious actors).

              A quick search didn’t yield anything immediately, but I wasn’t that thorough. I’d be interested if anyone knows of such a project that’s pretty easy to play with.

              • A Basil Plant@lemmy.world
                link
                fedilink
                English
                arrow-up
                5
                ·
                edit-2
                5 hours ago

                Not exactly what you asked, but do you know about ufw-blocklist?

                I’ve been using this on my multiple VPSes for some time now and the number of fail2ban failed/banned has gone down like crazy. Previously, I had 20k failed attempts after a few months and 30-50 currently-banned IPs at all times; now it’s less than 1k failed after a year and maybe 3-ish banned at any time.

                There was also that paid service where users share their spammy IP address attempts with a centralized network, which does some dynamic intelligence monitoring. I forgot the name and search these days isn’t great. Something to do with “Sense”? It was paid, but well recommended as far as I remember.

                Edit: seems like the keyword is " threat intelligence platform"