Robots.txt Tester | WebsiteTool.org

Test Robots.txt Rules

Enter a URL and user-agent to check if it's allowed or blocked by the website's robots.txt file.

How It Works

This tool fetches the robots.txt file from the domain, parses the rules for your chosen user-agent, and determines whether the specific URL path is allowed or blocked.

URL to Test

Enter the full URL you want to test

User-Agent

Select the bot user-agent to test

Custom User-Agent

Enter your custom user-agent name

What is a Robots.txt Tester?

A robots.txt tester is a tool that validates whether specific URLs on your website are allowed or blocked by your robots.txt file rules. It simulates how search engine crawlers interpret your robots.txt directives, helping you verify that important pages aren't accidentally blocked and unwanted pages are properly restricted.

Why Testing Robots.txt Matters

A single misplaced character or incorrect path in robots.txt can have catastrophic consequences. In 2022, a Fortune 500 e-commerce site accidentally deployed Disallow: / instead of Disallow: /admin/ to production, blocking their entire site from Google. They lost 92% of organic traffic within 14 days before discovering the error.

Common scenarios requiring testing:

Before deploying new robots.txt: Test all critical URLs (homepage, top products, key landing pages) to ensure they remain crawlable
After site restructuring: URL paths change during migrations - verify robots.txt rules still work correctly
Debugging crawl issues: When pages disappear from search results, test if robots.txt is blocking them
Complex wildcard rules: Patterns like Disallow: /*? or Disallow: /*.pdf$ can have unintended side effects
Multi-user-agent configurations: Different rules for Googlebot vs. Bingbot - ensure they work as intended

How Robots.txt Testing Works

The tester parses your robots.txt file following the Robots Exclusion Protocol (REP) standards. It reads User-agent, Disallow, and Allow directives in order, applying rules with the most specific matching pattern. More specific rules override general rules (longest matching path wins).

Example rule precedence:

User-agent: *
Disallow: /admin/
Allow: /admin/public/

Testing /admin/login/ → Blocked (matches /admin/)
Testing /admin/public/docs.html → Allowed (Allow /admin/public/ is more specific)

Why Robots.txt Testing Matters for SEO

Preventing Accidental De-Indexing

The most catastrophic robots.txt mistake is accidentally blocking important pages. Google Search Console data (2023) shows that 18% of all "Index Coverage" errors are caused by misconfigured robots.txt files blocking valuable content.

Real consequences: An online retailer blocked /*? (all URLs with query parameters) to prevent duplicate content indexing. Unfortunately, their product pages used /products?id=12345 structure. Result: 62,000 product pages deindexed, causing a 73% drop in product search traffic. Took 4 months to recover lost rankings.

Wildcard Pattern Validation

Modern robots.txt supports wildcards (* for any sequence, $ for URL end), but these can behave unexpectedly. Testing is essential before deployment.

Tricky wildcard examples:

Disallow: /*admin*/ - Blocks ANY URL containing "admin" anywhere (including /administrator/, /badminton/, /admin-panel/)
Disallow: /*.pdf$ - Blocks PDFs, but /*.pdf (without $) also blocks /brochure.pdf.html
Disallow: /*?* - Blocks all query parameters, but may accidentally block pagination like ?page=2
Disallow: /category/* - Blocks /category/products/ but NOT /categories/products/ (exact match required)

Google Search Console vs. Third-Party Testers

Google Search Console robots.txt Tester (under Legacy Tools & Reports) shows exactly how Googlebot interprets your rules. However, it only tests Googlebot - not Bingbot, DuckDuckBot, or other crawlers.

When to use each:

Google Search Console: Final validation for production robots.txt, official Googlebot behavior, testing live robots.txt on your domain
Third-party testers (like this tool): Testing draft robots.txt before deployment, testing multiple user-agents, quick validation without Search Console access

🚨 Critical: Test Before Deploy

Never deploy robots.txt to production untested. 18% of Google Index Coverage errors stem from robots.txt mistakes. Test critical URLs (homepage, top products, key landing pages) before going live.

✓ Longest Match Wins

When multiple rules match a URL, robots.txt uses the longest (most specific) matching pattern. Disallow: /admin/ + Allow: /admin/public/ = /admin/public/docs.html is allowed (more specific rule).

⚠️ Case Sensitivity Varies

Robots.txt paths can be case-sensitive depending on your server (Linux: case-sensitive, Windows: case-insensitive). Disallow: /Admin/ may or may not block /admin/ - test both cases.

How This Robots.txt Tester Works

Our tester simulates how search engine crawlers parse and apply robots.txt rules to specific URLs. Here's the technical process:

4-Step Testing Process

Fetch Robots.txt: The tool retrieves robots.txt from the domain you specify (always at https://yourdomain.com/robots.txt). If no file exists, it assumes all URLs are allowed (equivalent to empty robots.txt).
Parse Directives: The robots.txt file is parsed line-by-line, identifying User-agent blocks and their associated Disallow and Allow rules. Comments (#) and invalid lines are ignored.
Select User-Agent Rules: You choose which bot to simulate (Googlebot, Bingbot, * for all bots). The tool applies rules for the most specific matching user-agent. If you test as "Googlebot," it looks for User-agent: Googlebot first, falling back to User-agent: * if no Googlebot-specific rules exist.
Apply Pattern Matching: Your test URL is compared against all Disallow and Allow patterns for the selected user-agent. The longest matching pattern determines the result. Wildcards (*, $) are expanded during matching.

Understanding Test Results

Allowed: The URL can be crawled by the selected bot. Either no rules match the URL, or an Allow rule matched with higher specificity than any Disallow rule.

Blocked: The URL is blocked from crawling. A Disallow rule matched the URL, and no more specific Allow rule overrides it.

Example testing scenario:

User-agent: *
Disallow: /*.pdf$
Disallow: /private/
Allow: /private/public/

Test URL	Result	Reason
`/products/manual.pdf`	Blocked	Matches `/*.pdf$`
`/brochure.pdf.html`	Allowed	$ requires .pdf at end (not middle)
`/private/admin/`	Blocked	Matches `/private/`
`/private/public/docs.html`	Allowed	`/private/public/` is more specific

Best Practices for Testing

Test critical URLs first: Homepage (/), top products, key category pages, important blog posts
Test edge cases: URLs with query parameters (?page=2), trailing slashes (/products/ vs /products), case variations
Test both user-agents: If you have different rules for Googlebot and Bingbot, test URLs with both agents
Test after every robots.txt change: Even minor edits can have unintended consequences

Common Robots.txt Testing Errors & How to Fix Them

These are the most common mistakes discovered during robots.txt testing, along with their solutions:

Testing Shows Homepage Blocked

What it means: Your root URL (/) shows as "Blocked" when testing. This is almost always a critical error - it means Disallow: / exists in your robots.txt, blocking your entire site from search engines.

How to fix: Remove Disallow: / from your robots.txt unless you intentionally want to block everything. If you meant to block only the /admin/ directory, use Disallow: /admin/ (note the /admin/ path). After fixing, re-test homepage and all critical pages to confirm they're allowed.

Impact: Entire site deindexed from Google within 2-4 weeks. Traffic drops to near-zero. Real case (2023): E-commerce site lost $4.2M in quarterly revenue due to accidental 6-week Disallow: / deployment.

Wildcard Blocking More Than Intended

What it means: Pattern like Disallow: /*admin* intended to block /admin/ actually blocks /administrator/, /badminton/, /leadership-team/ (any URL containing "admin" anywhere). Testing reveals unexpected blocked URLs.

How to fix: Use specific paths instead of wildcards when possible. Replace Disallow: /*admin* with Disallow: /admin/. If you must block multiple admin-related paths, list them explicitly: Disallow: /admin/\nDisallow: /administrator/\nDisallow: /wp-admin/. Test each pattern individually.

Impact: Important pages accidentally blocked. News site using Disallow: /*print* to block printer-friendly pages also blocked /sprint-results/, /fingerprinting-guide/, /blueprint-downloads/ - lost 8% of indexable content.

Allow Rule Not Overriding Disallow

What it means: You have Disallow: /admin/ and Allow: /admin/public/, but testing shows /admin/public/docs.html as "Blocked." This happens when Allow appears before Disallow (robots.txt reads top-to-bottom, and order matters for same-specificity rules).

How to fix: Ensure Allow and Disallow rules are in the same User-agent block. The longest matching pattern wins regardless of order, but for same-length patterns, the last one wins. Best practice: Group related Allow/Disallow pairs together for clarity. Correct order:

User-agent: *
Disallow: /admin/
Allow: /admin/public/

Impact: Public admin resources (public documentation, open API endpoints) remain blocked when they should be crawlable. Reduces discoverability of legitimate public content.

Case Sensitivity Inconsistency

What it means: Disallow: /Admin/ blocks /Admin/login/ but NOT /admin/login/ when tested. However, on Linux servers, URLs are case-sensitive, so /Admin/ and /admin/ are different paths. On Windows servers, they're the same path.

How to fix: Always use lowercase paths in robots.txt to match URL conventions (most sites use lowercase). Use Disallow: /admin/ not Disallow: /Admin/. If your site has mixed-case URLs, test both variations. For comprehensive blocking, list both: Disallow: /admin/\nDisallow: /Admin/.

Impact: Inconsistent crawling behavior across different servers or URL structures. /admin/ might be crawled while /Admin/ is blocked, or vice versa, causing duplicate content or security exposure.

Query Parameters Blocking Pagination

What it means: Rule Disallow: /*? (block all query parameters) intended to prevent duplicate content also blocks pagination URLs like /products?page=2, category filters /shop?category=shoes, or search results /search?q=term.

How to fix: Use specific query parameter blocking instead of wildcard. Replace Disallow: /*? with specific parameters: Disallow: /*?utm_\nDisallow: /*?sessionid=\nDisallow: /*?sort=. Or use Allow exceptions: Disallow: /*?\nAllow: /*?page= (blocks all query params except pagination).

Impact: Paginated content (page 2, 3, 4...) deindexed. E-commerce filters blocked. Search result pages gone. Can cause 40-60% of site pages to disappear from Google for large catalog sites.

Forgetting Trailing Slash Behavior

What it means: Disallow: /admin (no trailing slash) blocks both /admin and /admin/ AND /administrator, /admin.html, /admin-panel/. But Disallow: /admin/ (with trailing slash) only blocks /admin/ and its subdirectories, NOT /admin or /administrator.

How to fix: Always use trailing slashes for directory blocking: Disallow: /admin/ not Disallow: /admin. Test both /admin and /admin/ to see the difference. The trailing slash makes the rule more predictable and prevents over-blocking.

Impact: Without trailing slash, Disallow: /admin over-blocks unintended URLs. With trailing slash, behavior is more precise but requires testing both /admin and /admin/ separately.

Real-World Robots.txt Testing Scenarios

See how robots.txt testing catches critical errors before they cause SEO damage:

E-Commerce Site Migration Testing

Scenario: Online retailer migrating from /products.php?id=123 to /products/product-name-123 URL structure. Old robots.txt blocked /*?* to prevent query parameter duplicates.

Testing revealed: New URL structure includes size/color filters: /products/shoes?size=10&color=black. The /*?* rule would block all filtered product pages.

Solution:

User-agent: *
# Block old query param structure
Disallow: /products.php?
# Allow new filter parameters
Allow: /products/*?
# Block tracking parameters
Disallow: /*?utm_
Disallow: /*?sessionid=

Result: Testing confirmed filtered product pages (4,800 important URLs) remained crawlable. Saved potential 62% traffic loss from accidentally blocking filtered URLs.

WordPress Plugin Conflict Detection

Scenario: Marketing blog installed a new security plugin that auto-generated robots.txt rules. Site traffic dropped 41% within 3 weeks.

Testing revealed: Plugin added Disallow: /wp-content/ to block WordPress core files, but this also blocked /wp-content/uploads/ containing all blog images and featured images for posts.

Generated robots.txt:

# Plugin-generated (PROBLEMATIC)
User-agent: *
Disallow: /wp-content/  # Blocks uploads!

Fixed robots.txt after testing:

User-agent: *
Disallow: /wp-admin/
Disallow: /wp-includes/
Allow: /wp-content/uploads/  # Explicitly allow images
Disallow: /wp-content/plugins/
Disallow: /wp-content/themes/

Result: Testing each critical URL path (images, blog posts, category pages) before deployment prevented 41% traffic loss from persisting.

SaaS Application Public/Private Testing

Scenario: SaaS company with public marketing site and private customer dashboard at /app/. Wanted to block app but allow public pages.

Initial robots.txt (untested):

User-agent: *
Disallow: /app

Testing revealed problems:

/app → Blocked ✓ (intended)
/app/ → Blocked ✓ (intended)
/apply-now → Blocked ✗ (unintended!) - Important lead capture page
/app-features → Blocked ✗ (unintended!) - Key marketing page

Fixed robots.txt:

User-agent: *
Disallow: /app/  # Trailing slash prevents /apply-now blocking

Result: Trailing slash prevented blocking 2 critical marketing pages that started with "app". Testing saved 23% of top-of-funnel traffic.

Multi-Language Site User-Agent Testing

Scenario: International site with different crawler rules for different search engines. Googlebot should crawl all languages, but Yandex only Russian pages.

Robots.txt:

User-agent: Googlebot
Disallow:
User-agent: Yandex
Disallow: /en/
Disallow: /de/
Disallow: /fr/
Allow: /ru/
User-agent: *
Disallow:

Testing matrix:

URL	Googlebot	Yandex	Bingbot
`/en/products`	✓ Allowed	✗ Blocked	✓ Allowed
`/ru/produkty`	✓ Allowed	✓ Allowed	✓ Allowed

Result: Testing confirmed Yandex correctly blocked non-Russian pages while Google/Bing crawled all languages. Proper geo-targeting without duplicate content penalties.

What is a robots.txt tester and why do I need one?

A robots.txt tester validates whether specific URLs are allowed or blocked by your robots.txt file rules. It simulates how search engine crawlers interpret your directives, helping you verify that important pages aren't accidentally blocked.

Why you need it: A single misconfigured robots.txt can cause catastrophic SEO damage. In 2022, a Fortune 500 site accidentally deployed Disallow: / instead of Disallow: /admin/, losing 92% of organic traffic within 14 days. Google Search Console data shows 18% of all Index Coverage errors stem from robots.txt mistakes.

Key use cases: (1) Before deploying new robots.txt to production, (2) After site restructuring or URL changes, (3) Debugging why pages disappeared from search results, (4) Validating complex wildcard rules like Disallow: /*?, (5) Testing multi-user-agent configurations (different rules for Googlebot vs. Bingbot).

Test before deploy: Always test critical URLs (homepage, top products, key landing pages) before making robots.txt live. Testing takes 30 seconds, recovering from accidental de-indexing takes 4-8 weeks.

How does robots.txt pattern matching work?

Robots.txt uses pattern matching to determine if a URL is allowed or blocked. The key principle: longest (most specific) matching pattern wins.

Pattern matching rules:

Exact prefix match: Disallow: /admin/ blocks /admin/login/, /admin/users/, etc.
Wildcard *: Matches any sequence of characters. Disallow: /*.pdf$ blocks all PDFs
End anchor $: Matches end of URL. Disallow: /*.pdf$ blocks /doc.pdf but NOT /doc.pdf.html
Case sensitivity: Depends on server (Linux: case-sensitive, Windows: case-insensitive)

Precedence example: If you have Disallow: /admin/ and Allow: /admin/public/, testing /admin/public/docs.html shows "Allowed" because /admin/public/ is more specific (longer) than /admin/.

Common mistake: Disallow: /*admin* blocks ANY URL containing "admin" (including /badminton/, /administrator/, /leadership-admin/), often unintentionally. Test thoroughly before using wildcards.

What's the difference between this tester and Google Search Console's tester?

Both tools serve similar purposes but have different use cases:

Google Search Console robots.txt Tester (Legacy Tools & Reports):

Tests exactly how Googlebot interprets your live robots.txt
Official Google tool - shows authoritative Googlebot behavior
Only tests Googlebot (not Bingbot, Yandex, or other crawlers)
Requires Search Console verification and access
Tests production robots.txt on your domain only

This third-party tester:

Tests any domain's robots.txt (no ownership required)
Can test multiple user-agents (Googlebot, Bingbot, DuckDuckBot, etc.)
Test draft robots.txt before deployment (paste content)
Faster for quick validation without Search Console login
Educational - shows matching rules and explains results

Best practice: Use this tool for pre-deployment testing and multi-agent validation. Use Google Search Console for final production verification and official Googlebot behavior confirmation.

Why does testing show "Allowed" but my page isn't in Google?

If robots.txt testing shows "Allowed" but your page isn't indexed, robots.txt is NOT the problem. Other common causes:

1. Meta robots noindex tag: Check your HTML for <meta name="robots" content="noindex">. This tells search engines not to index the page even though crawling is allowed. Common with WordPress plugins, staging sites, or development environments.

2. X-Robots-Tag HTTP header: Server may send X-Robots-Tag: noindex header. Use browser DevTools (Network tab) or curl to check: curl -I https://yoursite.com/page.

3. Canonical tag pointing elsewhere: <link rel="canonical" href="https://otherpage.com/"> tells Google this page is a duplicate. Google may index the canonical version instead.

4. Low page quality or thin content: Google may choose not to index low-value pages even if crawling is allowed.

5. Recent deployment: New pages take 1-7 days for Googlebot to crawl, 2-4 weeks to index. Use Google Search Console → URL Inspection → Request Indexing to speed up.

Debugging steps: (1) Google Search Console → URL Inspection for specific page status, (2) Check page source for meta robots tag, (3) Check canonical tag, (4) Verify page has substantial content (200+ words minimum).

How do I test if wildcards are blocking too much?

Wildcard patterns (* and $) can have unintended side effects. Here's how to test systematically:

Step 1: Test exact intended targets - If Disallow: /*.pdf$ should block PDFs, test /brochure.pdf, /docs/manual.pdf - should show "Blocked".

Step 2: Test edge cases - Test /brochure.pdf.html (should be Allowed since $ requires .pdf at end), /pdfs/ (should be Allowed, no .pdf extension), /PDF-guide.pdf (check case sensitivity).

Step 3: Test common false positives - For Disallow: /*admin*, test /administrator/, /badminton/, /madministrator/, /admin-panel/ to see if any are unintentionally blocked.

Step 4: Test query parameters carefully - Disallow: /*? blocks ALL query params. Test /products?page=2 (pagination), /search?q=term (search), /shop?category=shoes (filters) to ensure important functionality isn't blocked.

Best practice: Start with specific paths, not wildcards. Only use wildcards after testing confirms they don't over-block. Example: Instead of Disallow: /*admin*, use Disallow: /admin/\nDisallow: /wp-admin/\nDisallow: /administrator/ for precise control.

Does testing show how different search engines interpret my robots.txt?

Yes! You can test the same URL against different user-agents to see how each search engine's crawler would interpret your robots.txt rules.

Common user-agent testing scenarios:

1. Different rules for different bots: If you have Googlebot-specific rules and general rules for other bots, test URLs with both "Googlebot" and "*" (all bots) to ensure correct behavior. Example: You might allow Googlebot full access but rate-limit aggressive scrapers.

2. Geo-targeting: International sites sometimes block specific search engines from certain language versions. Test Yandex vs. Google to verify Russian pages are accessible to Yandex while other languages might not be.

3. Crawl-delay differences: While testing doesn't show crawl-delay values directly, you can verify which user-agents have crawl-delay directives and which have unrestricted access.

Important note: While robots.txt syntax is standardized (REP - Robots Exclusion Protocol), minor interpretation differences exist. Google and Bing follow the standard closely, but obscure crawlers may not support wildcards or may ignore Allow directives entirely.

Testing workflow: Test critical URLs with "Googlebot" first (covers 90%+ of search traffic), then test with "Bingbot" (covers most remaining traffic), finally test with "*" (catches any bots without specific rules).