Protection from AI bots and crawlers (SZ-73)
Artur Hefczyc opened 2 months ago

The main problem we faced was our servers overload by AI bots and crawlers. The most sensible solution seems to hide resource heavy from anonymous or guests access. I suggested to make these operations accessible based on user permissions. This would give us the most flexibility.

  • rk@tigase.net changed state to 'In Progress' 7 days ago
    Previous Value Current Value
    Open
    In Progress
  • rk@tigase.net commented 5 days ago

    Bot Attack Surface Area

    Screenshot 2026-03-10 at 6.03.11 PM.png

    Description

    1) Test Approach:

    • Choose tools to measure impact of Bots
    • Choose tools to induce Bot like stress
    • Form a baseline of resource usage with Bot attack before applying Bot guardrails
    • After each layer is added — verify the layer holds under the same load. We would watch the sztab-backend and sztabina pods specifically during bot stress tests.

    2) Identify tools to measure impact of Bot (CPU usage or I/O usage)

    • Grafana + Prometheus — we already have this or it's easy to add to the cluster via Helm.
    • Gives us CPU, memory, and network I/O per pod.

    3) Identify tools to induce Bot-like stress

    A) k6 — open source load testing tool

    We can write scripts in TypeScript and simulate concurrent anonymous/bot traffic against specific endpoints.

    Example:

    import http, { RefinedResponse, ResponseType } from 'k6/http';
    import { check } from 'k6';
    
    export default function (): void {
      const res: RefinedResponse<ResponseType> = http.get(
        'https://staging.sztab.com/api/projects/1/pulls/5/diff',
        {
          headers: { 'User-Agent': 'GPTBot/1.0' },
        }
      );
    
      check(res, {
        'status is 200': (r) => r.status === 200,
      });
    }
    

    B) Java with Gatling

    This is essentially the Java equivalent of k6. Since the broader Tigase team is Java-first, Gatling scripts would feel more natural to them and fit into Maven builds. Shall I use this option? This way the Bot simulation scripts an be reused for other Tigase projects.

    Kotlin developers can use Gatling in Kotlin; Java developers can use Gatling in Java

    C) JMX

    JMX scripts can serve a dual purpose:

    • Bot simulation
    • Stress test

    However, k6 is frictionless and will work "out of the box".


    4) Layered approach to Bot mitigation

    a) Layer 1: Spring Security — anonymous request blocking
    (lowest effort, highest impact)

    b) Layer 2: Caddy — rate limiting + bot filtering at the edge
    (before Spring even sees the request)

    c) Layer 3: robots.txt (soft signal, respected by well-behaved bots)

    d) Layer 4: Permission-based access (Artur's suggestion — most flexible)


    4.1 Layer 1

    The simplest way is to identify the most expensive APIs and mandate authentication for shortlisted APIs.

    With Spring this is easy: in the Spring Security policy add .authenticated() for such endpoints.

    APIs that trigger git clone and git merge are candidates.


    4.2 Layer 2

    Since Caddy is already our reverse proxy with forward_auth, we can add:

    # Rate limiting for anonymous traffic
    @anonymous not header Authorization *
    @anonymous not header Cookie *
    rate_limit @anonymous 10r/m
    
    # Block known bot user agents
    @bots header_regexp User-Agent `(?i)(GPTBot|ClaudeBot|CCBot|Bytespider|SemrushBot|AhrefsBot)`
    respond @bots 403
    

    This stops bots before they consume Spring Boot or Sztabina resources at all.


    4.3 Layer 3 — robots.txt

    Serve a robots.txt from Caddy directly blocking AI crawlers:

    User-agent: GPTBot
    Disallow: /
    
    User-agent: ClaudeBot
    Disallow: /
    
    User-agent: CCBot
    Disallow: /
    
    User-agent: *
    Disallow: /api/
    Allow: /
    

    This is a soft signal, respected only by well-behaved crawlers.


    4.4 Layer 4 — Permission-based access

    This is the existing ExternalUserPolicy / role system extended with a new dimension.

    Instead of just authenticated vs anonymous, we gate by role.

    Example:

    @PreAuthorize("hasPermission(#projectId, 'Project', 'READ_DIFFS')")
    public DiffResponse getPullRequestDiff(...) { ... }
    

    Roles like GUEST / COMMUNITY could be explicitly excluded from diff/search endpoints even if authenticated.

    This is useful if we ever allow public read-only accounts but still want to protect expensive resources.

    4.5 Layer 5 (Host Layer) — Using Host IDS (such as OSSEC)

    OSSEC / Wazuh (OSSEC's modern fork) can help — it does log analysis, anomaly detection, and can trigger active responses (e.g. auto-ban an IP via iptables). But I think for now this may be an overkill in Sztab's context.

    Known limitations

    • Authenticated scenario uses a single shared session cookie across all 20 VUs. Real bot farms distribute load across multiple accounts/sessions. A more realistic simulation would create 5-10 bot accounts and distribute cookies among VUs — deferred to a later iteration.
  • rk@tigase.net commented 4 days ago
    rksuma@Ramakrishnans-MacBook-Pro sztab % git checkout -b feature/SZ-73-Protection-from-AI-bots-and-crawlers 
    Switched to a new branch 'feature/SZ-73-Protection-from-AI-bots-and-crawlers'
    rksuma@Ramakrishnans-MacBook-Pro sztab %
    
  • rk@tigase.net commented 4 days ago

    I have assumed that the Bots/crawlers can cause performance issues Alon by exhausting resources.

    But bots can also attempt privilege escalation. Hence this issue is in part about security posture as well.

    Data harvesting is another risk: A crawler indexing all the issues, PRs, comments, and code — even if read-only, this is a confidentiality problem for private projects and can be used for competitor intelligence gathering.

    Please let me know if we should treat this as a performance issue alone in this rev.

  • rk@tigase.net commented 4 days ago

    Monitoring tool

    Phase 1 (immediate) — kubectl top for CPU/memory across the three pods during stress tests. Free, zero setup, good enough to establish baseline.

    Phase 2 (proper) — add node_exporter to the EC2 node for disk I/O, feed into Grafana alongside Caddy metrics. Full picture.

  • rk@tigase.net commented 3 days ago

    SZ-73 Bot Protection — Baseline Measurements

    Purpose

    Establish pre-mitigation resource usage baseline on staging, before any bot protection layers are applied. These numbers will be used to validate the effectiveness of each mitigation layer as it is implemented.

    Environment

    • Cluster: k3s on AWS EC2 (us-west-2)
    • Host: ec2-35-87-145-56.us-west-2.compute.amazonaws.com
    • Namespace: sztab-staging
    • Image tag: sz73-bot-protection (rebased on wolnosc, no SZ-73 changes applied yet)
    • Date: 2026-03-12

    Idle Baseline (no load)

    Captured via kubectl top pods -n sztab-staging with no active traffic.

    PodCPU (cores)Memory
    sztab-backend5m369Mi
    sztab-db4m46Mi
    sztabina1m1Mi
    caddy1m10Mi
    sztab-ui1m2Mi

    Notes:

    • sztab-backend memory at 369Mi reflects normal Spring Boot JVM baseline (expected)
    • sztabina and caddy are effectively idle
    • sztab-db at 4m CPU reflects background PostgreSQL activity only

    Bot Stress Baseline (under simulated load)

    TODO: Run k6 stress test simulating anonymous bot traffic against expensive endpoints. Capture CPU and memory spike for sztab-backend, sztabina, and sztab-db.

    Target Endpoints

    EndpointWhy Expensive
    GET /api/projects/{id}/pulls/{id}/diffTriggers git diff via Sztabina
    GET /api/projects/{id}/issues?q=...DSL query, DB-heavy
    GET /api/projects/{id}/files/{branch}Git tree traversal via Sztabina

    k6 Test Parameters

    • Virtual users: TBD
    • Duration: TBD
    • User-Agent: GPTBot/1.0 (simulates AI crawler)
    • Auth: none (anonymous)

    Results

    TODO: Fill in after k6 run.

    PodCPU (cores)MemoryDelta vs Idle
    sztab-backend---
    sztab-db---
    sztabina---

    Post-Mitigation Measurements

    TODO: Re-run same k6 test after each layer is applied and record results here.

    LayerDescriptionBackend CPUSztabina CPUNotes
    Layer 1Spring Security .authenticated()---
    Layer 2Caddy rate limiting + bot UA blocking---
    Layer 3robots.txt--soft signal only
    Layer 4Permission-based access (role gating)---
  • rk@tigase.net commented 3 days ago

    Next step: install k6 on my laptop:

    rksuma@Ramakrishnans-MacBook-Pro sztab %  brew install k6
    //...
    rksuma@Ramakrishnans-MacBook-Pro sztab %  k6 version
    k6 v1.6.1 (commit/devel, go1.26.0, darwin/arm64)
    rksuma@Ramakrishnans-MacBook-Pro sztab % 
    

    Now, I'll write a Python or Typescript script targeting the three expensive endpoints with a GPTBot user agent, no auth, and enough virtual users to actually stress the backend.

  • rk@tigase.net referenced from other issue 2 days ago
  • rk@tigase.net commented 2 days ago

    Results of Layer 1 testing after locking down all expensive methods with .authrequired() => (please disregard the spurious error at the end in deleting the test project)

    Essentially since the Bot does not authenticate itself, it runs into http/403 for all hits and hence makes no difference to the resource usage of Sztab.

    
    rksuma@Ramakrishnans-MacBook-Pro sztab % ADMIN_USER=admin ADMIN_PASSWORD=SztabStagingAdmin! ./scripts/stress-test/k6/run-stress-test.sh
    [INFO]  === SZ-73 Bot Stress Test ===
    [INFO]  Base URL:    http://ec2-35-87-145-56.us-west-2.compute.amazonaws.com
    [INFO]  Namespace:   sztab-staging
    [INFO]  VUs:         50
    [INFO]  Duration:    60s
    [INFO]  --- Step 1: Login ---
    [INFO]  Login successful.
    [INFO]  Logged in as user id=1
    [INFO]  --- Step 2: Create Sztab project ---
    [INFO]  Project 'SZ73 Stress Test' already exists — looking up existing project...
    [INFO]  Found existing project: id=16
    [INFO]  --- Step 3: Create issue ---
    [INFO]  Issue created: id=3
    [INFO]  --- Step 4: Create pull request ---
    [INFO]  Pull request created: id=3
    [INFO]  --- Step 5: Baseline pod metrics (idle) ---
    NAME                            CPU(cores)   MEMORY(bytes)   
    caddy-847774bbf9-xzvnv          1m           12Mi            
    sztab-backend-644c77d58-r46xd   2m           432Mi           
    sztab-db-fb967c9d5-fs84w        2m           44Mi            
    sztab-ui-57764ffc4f-r9hlg       1m           3Mi             
    sztabina-65b5cff756-kzl4f       1m           3Mi             
    [INFO]  --- Step 6: Run k6 stress test ---
    [INFO]  Watch pod metrics in another terminal: kubectl top pods -n sztab-staging --watch
    
             /\      Grafana   /‾‾/  
        /\  /  \     |\  __   /  /   
       /  \/    \    | |/ /  /   ‾‾\ 
      /          \   |   (  |  (‾)  |
     / __________ \  |_|\_\  \_____/ 
    
    
         execution: local
            script: /Users/rksuma/tigase/sztab/scripts/stress-test/k6/bot-stress-test.ts
            output: -
    
         scenarios: (100.00%) 1 scenario, 50 max VUs, 1m30s max duration (incl. graceful stop):
                  * default: 50 looping VUs for 1m0s (gracefulStop: 30s)
    
    
    
      █ THRESHOLDS 
    
        http_req_duration
        ✓ 'p(95)<5000' p(95)=134.53ms
    
    
      █ TOTAL RESULTS 
    
        checks_total.......: 69856  1161.339279/s
        checks_succeeded...: 25.00% 17464 out of 69856
        checks_failed......: 75.00% 52392 out of 69856
    
        ✗ status is 200 (unprotected)
          ↳  0% — ✓ 0 / ✗ 17464
        ✗ status is 401 (auth required)
          ↳  0% — ✓ 0 / ✗ 17464
        ✓ status is 403 (bot blocked)
        ✗ status is 429 (rate limited)
          ↳  0% — ✓ 0 / ✗ 17464
    
        HTTP
        http_req_duration....: avg=71.19ms  min=29.65ms  med=55.02ms  max=422.05ms p(90)=124.52ms p(95)=134.53ms
        http_req_failed......: 100.00% 17464 out of 17464
        http_reqs............: 17464   290.33482/s
    
        EXECUTION
        iteration_duration...: avg=172.15ms min=130.26ms med=155.68ms max=522.51ms p(90)=225.17ms p(95)=235.3ms 
        iterations...........: 17464   290.33482/s
        vus..................: 50      min=50             max=50
        vus_max..............: 50      min=50             max=50
    
        NETWORK
        data_received........: 7.8 MB  129 kB/s
        data_sent............: 2.3 MB  38 kB/s
    
    
    
    
    running (1m00.2s), 00/50 VUs, 17464 complete and 0 interrupted iterations
    default ✓ [======================================] 50 VUs  1m0s
    [INFO]  --- Step 7: Pod metrics (post-stress) ---
    NAME                            CPU(cores)   MEMORY(bytes)   
    caddy-847774bbf9-xzvnv          99m          20Mi            
    sztab-backend-644c77d58-r46xd   252m         440Mi           
    sztab-db-fb967c9d5-fs84w        2m           45Mi            
    sztab-ui-57764ffc4f-r9hlg       1m           3Mi             
    sztabina-65b5cff756-kzl4f       1m           4Mi             
    [INFO]  === Stress test complete. Teardown will run now. ===
    [INFO]  --- Teardown ---
    [INFO]  Deleting Sztab project 16...
    [ERROR] Failed to delete project 16
    [INFO]  Teardown complete.
    rksuma@Ramakrishnans-MacBook-Pro sztab %
    
  • rk@tigase.net commented 1 day ago

    Baseline stress test results (pre-protection, 2026-03-14)

    Ran k6 stress test against staging (ec2-35-87-145-56.us-west-2.compute.amazonaws.com) with 50 VUs for 60s — 30 unauthenticated (anonymous bot simulation) and 20 authenticated (bot with DEVELOPER role, hitting issues/PR/branch endpoints).

    Throughput: 279 req/s

    Pod metrics (idle → under load)

    PodCPU idleCPU loadMemory idleMemory load
    sztab-backend2m370m443Mi544Mi
    sztab-db4m137m46Mi77Mi
    caddy1m117m23Mi23Mi
    sztabina1m1m2Mi2Mi

    Observations

    • Unauthenticated requests: 100% returning 403 -- Layer 1 (Spring Security) blocking all anonymous traffic correctly.
    • Authenticated requests: 100% returning 200 -- DEVELOPER role has correct read access.
    • Backend CPU peaks at 370m under load -- this is the baseline to beat after Caddy rate limiting is applied.
    • DB CPU peaks at 137m -- issue/PR list queries are the likely driver.
    • Sztabina unaffected -- git ops not triggered by read-only REST traffic.

    Known limitations

    • Authenticated scenario uses a single shared session cookie across all 20 VUs. Real bot farms distribute load across multiple accounts/sessions. A more realistic simulation would create 5-10 bot accounts and distribute cookies among VUs -- deferred to a later iteration.

    Next steps

    Implement Layer 2 (Caddy rate limiting) and re-run to measure impact.

  • rk@tigase.net commented 20 hours ago

    Layer 2: Caddy-level rate limiting and bot blocking

    Rejection is now pushed upstream to the reverse proxy, before requests ever reach the JVM. I added two defenses to the Caddyfile:

    • UA blocklist -- known well-behaved AI crawlers (GPTBot, ClaudeBot, CCBot, Bytespider, SemrushBot, AhrefsBot) are rejected with 403 at the proxy edge. Btw, this check is easily sidestepped: adversarial scrapers that spoof their user agent will bypass this, which is why rate limiting is the primary defense.

    • Anonymous rate limiting -- unauthenticated traffic is capped at 30 requests/min per IP. Authenticated users (identified by session cookie or API token) are exempt. At 30 r/m, a human browsing casually has ample headroom; a bot hammering endpoints hits the ceiling immediately.

    To support this, I built a custom Caddy image with the rate limiting plugin baked in, pinned to v2.8.4 for reproducibility. The next stress test run will measure how much backend CPU drops as a result.

  • rk@tigase.net commented 20 hours ago

    Layer 2 stress test results (Caddy rate limiting, 2026-03-14)

    Setup

    Same test as baseline: 50 VUs for 60s, 30 unauthenticated and 20 authenticated (DEVELOPER role). Rate limiting applied to anonymous traffic only (30 r/min per IP).

    Pod metrics (idle => under load)

    PodCPU idleCPU loadMemory idleMemory load
    sztab-backend2m174m443Mi542Mi
    sztab-db4m147m46Mi77Mi
    caddy1m102m12Mi17Mi
    sztabina1m1m2Mi2Mi

    Comparison vs baseline (Layer 1 only)

    PodLayer 1Layer 2Change
    sztab-backend370m174m-53%
    sztab-db137m147m~flat (noise)
    caddy117m102m-13%

    Observations

    • Backend CPU dropped by 53% -- anonymous bot traffic is now absorbed by Caddy before requests reach the JVM. The JVM no longer wakes up, allocates objects, or runs the filter chain for unauthenticated requests that exceed the rate limit.
    • DB CPU is flat -- authenticated queries still run as expected. The reduction in backend CPU is entirely from eliminating the unauthenticated filter chain overhead.
    • Caddy CPU is slightly lower too -- the rate limit decision short-circuits before the upstream proxy step, so Caddy does less work per rejected request than it did forwarding 403s from the backend.
    • Memory is stable across both scenarios -- no sign of heap pressure or GC storms under load.

    Next steps

    Layer 3 (robots.txt) and Layer 4 (permission-based access gating) to follow.

issue 1 of 1
Type
New Feature
Priority
Normal
Assignee
Version
none
Sprints
n/a
Customer
n/a
Issue Votes (0)
Watchers (3)
Reference
SZ-73
Please wait...
Page is in error, reload to recover