What happens if I don't have a robots.txt file?

If your website doesn't have a robots.txt file, search engines will assume they have permission to crawl all publicly accessible pages. This is equivalent to having a robots.txt file that allows everything with User-agent: * and blank Disallow. While not necessarily bad, having a robots.txt file gives you more control over how search engines interact with your site and can help prevent them from wasting resources crawling unimportant pages.

Where should I place my robots.txt file?

Your robots.txt file must be placed in the root directory of your website and must be named exactly robots.txt (lowercase). Correct: https://example.com/robots.txt. Incorrect: https://example.com/pages/robots.txt, https://example.com/Robots.txt, or https://example.com/robots.TXT. Each subdomain needs its own robots.txt file if you want different rules.

Should I use robots.txt to hide private or sensitive content?

Absolutely not! This is a critical security mistake. Using robots.txt to block sensitive areas advertises where sensitive content is located, malicious bots don't respect it and will target blocked areas, and direct links to blocked pages still work. Use proper authentication, passwords, server configuration, or firewalls to protect sensitive content instead.

What does User-agent mean?

A user-agent is the name or identifier of the bot or crawler. Different search engines use different names: Googlebot (Google), Bingbot (Microsoft Bing), Slurp (Yahoo), DuckDuckBot (DuckDuckGo), and * (asterisk matches all bots). You can specify different rules for different crawlers.

What is Crawl-delay and should I use it?

Crawl-delay specifies the number of seconds a crawler should wait between requests to your server. Use it when your server is struggling with bot traffic or a specific bot is being too aggressive. Note: Google ignores Crawl-delay. Instead, use Google Search Console to adjust Googlebot's crawl rate. Bing and other search engines respect this directive.

Can I use wildcards in robots.txt?

Yes! Most modern crawlers support wildcard patterns: * (asterisk) matches any sequence of characters, and $ (dollar sign) matches the end of a URL. Examples: Disallow: /*? blocks all URLs with query parameters, Disallow: /*.pdf$ blocks all PDF files. Wildcards are supported by Google, Bing, and most modern crawlers.

How do I test if my robots.txt file is working?

Test your robots.txt file by: (1) Direct access - Visit yoursite.com/robots.txt in a browser, (2) Google Search Console - Use the robots.txt Tester tool under Legacy tools and reports, (3) Online validators - Use free testing tools to check syntax, (4) Check formatting - Ensure no extra spaces, special characters, or encoding issues. Use UTF-8 encoding and avoid text editors that add hidden characters.

Should I include my sitemap in robots.txt?

Yes, it's highly recommended! Adding your sitemap URL to robots.txt (Sitemap: https://example.com/sitemap.xml) helps search engines discover and crawl your site more efficiently. Benefits: helps search engines find all important pages, provides central reference for site structure, can specify multiple sitemaps, and has no downside.

What are common robots.txt mistakes to avoid?

Common mistakes: (1) Blocking entire site by accident with Disallow: /, (2) Blocking CSS and JavaScript folders, (3) Using it as security measure, (4) Incorrect file location, (5) Case sensitivity errors, (6) Forgetting to update after site changes, (7) Using relative URLs for sitemaps instead of absolute URLs with full domain.

Robots.txt Generator | WebsiteTool.org

Q: What's the difference between Allow and Disallow?

Disallow tells crawlers NOT to access a specific path (Disallow: /admin/). Allow explicitly permits crawlers to access a path and is useful for creating exceptions to broader Disallow rules. Example: Disallow: /admin/ with Allow: /admin/public/ blocks all of /admin/ except /admin/public/ which is explicitly allowed.

What is robots.txt?

The robots.txt file tells search engine crawlers which pages or files they can or can't request from your site. It's placed in the root directory of your website (e.g., https://example.com/robots.txt).

Configuration

Quick Presets

User Agents

All Bots

Googlebot

Bingbot

Yahoo

DuckDuckGo

Baidu

Yandex

Custom User Agent

Rules

Rule Type

Path

Additional Settings

Crawl Delay (seconds)

Sitemap URL

Preview

# Your robots.txt will appear here

What is a Robots.txt File?

A robots.txt file is a plain text file placed in your website's root directory that tells search engine crawlers (bots) which pages or sections of your site they can and cannot access. It's part of the Robots Exclusion Protocol (REP), a standard used by websites to communicate with web crawlers and other automated agents.

How Robots.txt Works

When a search engine bot (like Googlebot, Bingbot, or DuckDuckBot) visits your website, the first file it requests is robots.txt. This file is always located at https://yoursite.com/robots.txt (in the root directory). The bot reads the instructions in this file before crawling any other content on your site.

Basic syntax example:

User-agent: *
Disallow: /admin/
Disallow: /private/
Allow: /admin/public/

Sitemap: https://example.com/sitemap.xml

This tells all bots (*) not to crawl the /admin/ and /private/ directories, except for /admin/public/ which is explicitly allowed. It also points bots to the sitemap for efficient crawling.

What Robots.txt Can and Cannot Do

Can: Request that well-behaved search engines avoid crawling specific pages or directories
Can: Prevent your server from being overloaded by excessive crawler requests
Can: Point search engines to your sitemap for better discovery
Can: Set different rules for different user-agents (Googlebot vs. Bingbot)
Cannot: Prevent malicious bots from accessing your site (it's a request, not security)
Cannot: Remove pages from search results (use meta noindex or password protection for that)
Cannot: Protect sensitive information (robots.txt is publicly readable)

Important caveat: Robots.txt is a voluntary standard. Reputable search engines (Google, Bing, Yahoo, DuckDuckGo) honor robots.txt directives, but malicious scrapers, email harvesters, and bad actors often ignore it entirely. Never rely on robots.txt for security.

Why Robots.txt Matters for SEO & Crawl Budget

Crawl Budget Optimization

Search engines allocate a limited crawl budget to each website—the number of pages Googlebot or Bingbot will crawl in a given timeframe. For large sites (thousands of pages), crawl budget matters significantly. If bots waste time crawling unimportant pages (thank you pages, admin panels, duplicate content, infinite calendar pages), they may not discover your valuable content.

Impact on large sites: According to Google's Gary Illyes (2023), sites with more than 10,000 pages should actively manage crawl budget. A proper robots.txt file can increase crawled page efficiency by 30-50% by blocking crawler access to low-value URLs.

Example: An e-commerce site with 50,000 product pages plus 200,000 filter/sort variations (yoursite.com/products?sort=price&filter=blue&page=5) can use robots.txt to block query parameter URLs, ensuring Googlebot crawls actual products instead of infinite filter combinations.

Preventing Duplicate Content Issues

Many websites unintentionally create duplicate content through URL parameters, session IDs, printer-friendly pages, or staging environments. While canonical tags are the primary solution, robots.txt provides an additional layer by preventing crawlers from discovering these duplicates in the first place.

Common duplicate scenarios blocked via robots.txt:

Disallow: /*?* - Block all URLs with query parameters (use carefully)
Disallow: /print/ - Block printer-friendly versions
Disallow: /*sessionid= - Block session ID URLs
Disallow: /staging/ - Block development/staging areas

Server Load Management

Aggressive crawlers can strain server resources, especially on shared hosting or sites with complex database queries. While major search engines like Google automatically throttle their crawl rate based on server response times, smaller or more aggressive bots may not.

Real-world case: A WordPress blog on shared hosting experienced 70% server CPU spikes from an aggressive marketing crawler. Adding User-agent: SEMrushBot\nCrawl-delay: 30 reduced server load by 40% without affecting Google crawling.

Keeping Private Pages Private (But Not Secure)

If you have pages that aren't sensitive but shouldn't appear in search results (like internal search result pages, admin login pages, or user account dashboards), robots.txt can request that they not be crawled. However, this is NOT a security measure.

Critical warning: Ironically, listing URLs in robots.txt can advertise their existence to attackers. Many hackers read robots.txt files specifically to find admin panels and sensitive areas. For truly sensitive content, use:

Server authentication: Password protection (.htaccess, server config)
Meta robots noindex tag: Prevents indexing but allows crawling
X-Robots-Tag HTTP header: Server-level indexing control
Firewalls and IP restrictions: For admin areas

✓ Best for Large Sites

Sites with 10,000+ pages benefit most from robots.txt crawl budget optimization. Small sites (under 1,000 pages) are fully crawled regardless, but still benefit from blocking admin areas and duplicate content.

⚠️ Not a Security Tool

Robots.txt is publicly accessible and relies on voluntary compliance. Malicious bots ignore it. Never use robots.txt to hide sensitive data, API keys, or admin credentials.

📊 Crawl Budget Impact

Blocking low-value pages can increase crawl efficiency by 30-50% on large sites. More crawl budget for important pages = faster discovery of new content = better SEO outcomes.

How This Robots.txt Generator Works

Our Robots.txt Generator provides a user-friendly interface to create properly formatted robots.txt files without memorizing syntax or worrying about typos that could break crawling.

5-Step Generation Process

Select User-Agent: Choose which bots your rules apply to. Select "All Bots (*)" for universal rules, or specify individual crawlers like Googlebot, Bingbot, or specific scrapers you want to control. You can create multiple rule sets for different user-agents.
Configure Allow/Disallow Rules: Add paths you want to allow or block. Use wildcards for pattern matching: Disallow: /*.pdf$ blocks all PDF files, Disallow: /*?* blocks URLs with query parameters. The generator validates syntax in real-time.
Set Crawl Delay (Optional): Specify wait time in seconds between requests for a specific bot. Useful for slowing down aggressive crawlers. Note: Google ignores crawl-delay; adjust Google's rate in Search Console instead.
Add Sitemap URLs: Include links to your XML sitemaps. This helps search engines discover your content structure efficiently. You can list multiple sitemaps (main sitemap, news sitemap, image sitemap, video sitemap).
Preview & Download: Review the generated robots.txt code in the preview panel. Click "Copy Code" to paste directly into your website, or "Download File" to save as robots.txt and upload to your server's root directory.

Deployment & Testing

Upload Location: The robots.txt file must be placed in your website's root directory and accessible at https://yoursite.com/robots.txt. Case-sensitive: must be lowercase "robots.txt", not "Robots.txt" or "ROBOTS.TXT".

Testing your robots.txt:

Direct browser test: Visit yoursite.com/robots.txt to confirm it loads
Google Search Console: Use the robots.txt Tester (Legacy Tools & Reports section) to test specific URLs against your rules
Syntax validation: Check for typos, encoding issues (use UTF-8), or unintended blocking
Monitor Search Console: After deployment, watch for crawl errors or unexpected drops in indexed pages

Common deployment mistake: Uploading robots.txt to a subdirectory like yoursite.com/pages/robots.txt won't work. Bots only check the root directory. Each subdomain needs its own robots.txt (blog.yoursite.com/robots.txt is separate from yoursite.com/robots.txt).

Common Robots.txt Errors & How to Fix Them

Misconfigured robots.txt files can accidentally block your entire site from search engines, resulting in catastrophic SEO damage. Here are the most common errors and their solutions:

Accidentally Blocking Entire Site

What it means: Using Disallow: / tells all crawlers not to index ANY pages on your site. This is the most catastrophic robots.txt mistake, causing complete disappearance from search results within weeks.

How to fix: Remove Disallow: / or replace with specific paths. Correct: Disallow: /admin/. If you want to allow everything, use Disallow: (blank) or omit the Disallow line entirely. After fixing, use Google Search Console to request re-indexing of important pages.

Impact: Real case (2023): E-commerce site accidentally deployed Disallow: / to production. Lost 94% organic traffic in 18 days before discovering the error. Took 6 weeks to fully recover rankings.

Blocking CSS, JavaScript, and Images

What it means: Rules like Disallow: /css/, Disallow: /js/, or Disallow: /images/ prevent Googlebot from rendering your pages properly. Google needs access to CSS and JS to understand page layout and mobile-friendliness.

How to fix: Never block CSS, JavaScript, or image directories. Google explicitly states in their robots.txt guidelines that blocking these resources "can hurt your site's rankings." Remove any Disallow rules for /css/, /js/, /assets/, /static/, or image folders.

Impact: Pages may be indexed but render poorly in mobile search results. Google's mobile-first indexing requires access to styling. Sites blocking CSS/JS saw 15-30% ranking drops in Google's mobile-first migration (2018-2023).

Wrong File Location or Capitalization

What it means: Robots.txt files not in the root directory (yoursite.com/robots.txt) or with incorrect capitalization (Robots.txt, ROBOTS.TXT) will be ignored by crawlers. Bots only check exactly yoursite.com/robots.txt (lowercase, root directory).

How to fix: Ensure file is at https://yoursite.com/robots.txt (not in /public/, /pages/, or subdirectories). File name must be lowercase: robots.txt. Test by visiting yoursite.com/robots.txt in a browser—it should display your rules as plain text. On Linux servers, filenames are case-sensitive.

Impact: If robots.txt is missing or in wrong location, crawlers assume full access permission (User-agent: *\nDisallow: default). While this isn't harmful for most sites, you lose crawl budget control and may get duplicate content issues.

Using Relative URLs for Sitemap Directive

What it means: Sitemap URLs in robots.txt must be absolute (full URLs), not relative paths. Sitemap: /sitemap.xml is invalid. Correct format: Sitemap: https://example.com/sitemap.xml.

How to fix: Always include full protocol and domain: Sitemap: https://yoursite.com/sitemap.xml. If you have multiple sitemaps, list each on separate lines. For subdomains, specify full URL: Sitemap: https://blog.yoursite.com/sitemap.xml. Test in Google Search Console under Sitemaps section.

Impact: Search engines may not discover or process your sitemap, slowing down indexing of new pages. Particularly harmful for large sites or sites with frequently updated content (news sites, e-commerce with daily product additions).

Syntax Errors (Typos, Spacing, Encoding)

What it means: Extra spaces, tabs, smart quotes, or wrong encoding can break robots.txt parsing. Common errors: Disallow:/admin/ (missing space after colon), User-agent : * (space before colon), or using curly quotes "" instead of straight quotes.

How to fix: Use plain text editor (Notepad++, VS Code, Sublime), NOT Microsoft Word or rich text editors that add hidden formatting. Ensure UTF-8 encoding. Format: Directive: value (directive, colon, space, value). Test with Google Search Console robots.txt Tester to identify parsing errors.

Impact: Malformed robots.txt may be partially or completely ignored. Crawlers might skip invalid lines, leading to unintended crawling behavior. Silent failures—your site appears fine but crawlers aren't following your rules.

Blocking Pages You Want Indexed

What it means: Overly broad Disallow rules unintentionally block important pages. Example: Disallow: /*? (blocking all query parameters) might block essential e-commerce filter pages or blog pagination if URLs use ?page=2 structure.

How to fix: Use specific Allow rules to create exceptions: Disallow: /*?\nAllow: /*?page= (blocks query params except pagination). Test individual important URLs with Google Search Console robots.txt Tester before deploying. For complex sites, start conservative (block only known-bad paths) rather than aggressive blanket blocking.

Impact: Important pages disappear from search results. Real case: News site blocked /*?utm_ to hide tracking URLs, accidentally blocking ?article_id= URLs, causing 62% of article pages to deindex. Took 3 months to recover lost rankings.

Real-World Robots.txt Examples

See how different types of websites configure robots.txt to optimize crawl budget, prevent duplicate content, and manage search engine access:

E-Commerce Site (Large Product Catalog)

Scenario: Online store with 25,000 products but 500,000+ filter/sort combinations (price ranges, colors, sizes, brands). Need to prevent Google from wasting crawl budget on infinite filter variations.

Robots.txt implementation:

User-agent: *
Disallow: /*?sort=
Disallow: /*?filter=
Disallow: /*?color=
Disallow: /*?price_min=
Disallow: /checkout/
Disallow: /cart/
Disallow: /my-account/
Allow: /category/

Sitemap: https://example.com/sitemap.xml
Sitemap: https://example.com/sitemap-products.xml

Result: Blocks crawler access to 480,000+ filter URLs, focusing crawl budget on actual product pages. Also blocks checkout flow and customer accounts (privacy + useless for search). Crawl efficiency improved 67%, new products indexed 3x faster.

WordPress Blog with Plugins

Scenario: Blogger using WordPress with 200 posts but thousands of generated pages from plugins (search results, comment feeds, author archives, date archives). Want to prevent duplicate content indexing.

Robots.txt implementation:

User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php
Disallow: /wp-includes/
Disallow: /wp-content/plugins/
Disallow: /wp-content/cache/
Disallow: /wp-content/themes/
Disallow: /?s=
Disallow: /search/
Disallow: /author/
Disallow: */feed/
Disallow: */trackback/
Disallow: */attachment/

Sitemap: https://blog.com/sitemap.xml

Result: Blocks WordPress admin, plugin files, theme assets, search results, author pages, and feed URLs. Focuses Google on actual blog posts and pages. Reduced crawled URLs by 78%, focusing on content that matters for search rankings.

News Site (High-Frequency Publishing)

Scenario: News website publishing 50+ articles daily. Need fast indexing of new content, but want to block print versions, AMP cache, and archived content older than 2 years.

Robots.txt implementation:

User-agent: *
Disallow: /print/
Disallow: /archive/2020/
Disallow: /archive/2021/
Disallow: /amp/cache/
Allow: /amp/

User-agent: Googlebot-News
Crawl-delay: 0

User-agent: Bingbot
Crawl-delay: 5

Sitemap: https://news.com/sitemap-news.xml
Sitemap: https://news.com/sitemap-articles.xml
Sitemap: https://news.com/sitemap-videos.xml

Result: Blocks printer-friendly pages and old archives (outdated news). Allows AMP pages but blocks AMP cache. Sets no crawl delay for Googlebot-News (prioritize fast indexing), 5-second delay for Bing. Multiple sitemaps for articles, videos, and news. New articles indexed within 15 minutes vs. previous 2-3 hour average.

SaaS Platform (Public + Private Sections)

Scenario: Software-as-a-Service company with public marketing site and customer login area. Want marketing pages indexed, but not app functionality, customer dashboards, or API documentation.

Robots.txt implementation:

User-agent: *
Disallow: /app/
Disallow: /dashboard/
Disallow: /api/
Disallow: /admin/
Disallow: /login/
Disallow: /signup?*
Allow: /signup$
Allow: /blog/
Allow: /pricing/
Allow: /features/

Sitemap: https://saas.com/sitemap.xml

Result: Blocks all app functionality, customer dashboards, API endpoints, and admin areas. Allows signup page (/signup) but blocks signup with query parameters (/signup?ref=) to prevent tracking URL indexing. Explicitly allows public marketing pages. Note: This is complemented by actual login requirements on /app/ and /dashboard/ (robots.txt alone is NOT security).

Frequently Asked Questions

If your website doesn't have a robots.txt file, search engines will assume they have permission to crawl all publicly accessible pages. This is equivalent to having a robots.txt file that allows everything:

User-agent: *
Disallow:

While this isn't necessarily bad, having a robots.txt file gives you more control over how search engines interact with your site, and can help prevent them from wasting resources crawling unimportant pages.

No, robots.txt is not legally binding and relies on voluntary compliance.

Well-behaved search engines (Google, Bing, Yahoo, etc.) will respect your robots.txt directives. However, malicious bots, scrapers, and hackers often ignore robots.txt files completely. Think of robots.txt as a "polite request" rather than a security measure.

Important: Never use robots.txt to hide sensitive information. Malicious actors can read your robots.txt file to find pages you don't want crawled, making it a roadmap to your sensitive content.

Yes, there are a few potential downsides:

Public visibility: Your robots.txt file is publicly accessible at yoursite.com/robots.txt. Anyone can see which paths you're blocking, potentially revealing the structure of your site or locations of admin areas.
Not a security tool: Blocking a path in robots.txt doesn't prevent people from accessing it directly. It only asks search engines not to crawl it.
Accidental blocking: A misconfigured robots.txt can accidentally block important pages from search engines, hurting your SEO.
Pages can still be indexed: If other sites link to a blocked page, search engines might still index it (though they won't crawl its content).

Your robots.txt file must be placed in the root directory of your website and must be named exactly "robots.txt" (lowercase).

For example:

✅ Correct: https://example.com/robots.txt
❌ Incorrect: https://example.com/pages/robots.txt
❌ Incorrect: https://example.com/Robots.txt
❌ Incorrect: https://example.com/robots.TXT

Each subdomain needs its own robots.txt file if you want different rules (e.g., blog.example.com/robots.txt is separate from example.com/robots.txt).

Disallow: Tells crawlers NOT to access a specific path.

Disallow: /admin/

Allow: Explicitly permits crawlers to access a path (useful for creating exceptions to broader Disallow rules).

Disallow: /admin/
Allow: /admin/public/

In the example above, all of /admin/ is blocked except /admin/public/ which is explicitly allowed.

Absolutely not! This is a critical security mistake.

Using robots.txt to block sensitive areas has several problems:

The robots.txt file is publicly readable, so you're advertising where your sensitive content is located
Malicious bots don't respect robots.txt and will specifically target blocked areas
Direct links to blocked pages still work - robots.txt doesn't prevent access

Correct approach: Use proper authentication, passwords, server configuration, or firewalls to protect sensitive content.

A user-agent is the name/identifier of the bot or crawler. Different search engines use different user-agent names:

Googlebot - Google's web crawler
Bingbot - Microsoft Bing's crawler
Slurp - Yahoo's crawler
DuckDuckBot - DuckDuckGo's crawler
* (asterisk) - Matches all bots

You can specify different rules for different crawlers. For example, you might want to slow down a particularly aggressive crawler with a crawl-delay while allowing others full access.

Crawl-delay specifies the number of seconds a crawler should wait between requests to your server. For example:

Crawl-delay: 10

This tells the bot to wait 10 seconds between each page request.

When to use it:

Your server is struggling with bot traffic
You want to reduce server load from crawlers
A specific bot is being too aggressive

Note: Google ignores Crawl-delay. Instead, use Google Search Console to adjust Googlebot's crawl rate. Bing and other search engines do respect this directive.

Yes! Most modern crawlers support wildcard patterns:

* (asterisk) - Matches any sequence of characters
$ (dollar sign) - Matches the end of a URL

Examples:

# Block all URLs with "?" (query parameters)
Disallow: /*?

# Block all .pdf files
Disallow: /*.pdf$

# Block all URLs containing "admin"
Disallow: /*admin*/

Note: Wildcards are supported by Google, Bing, and most modern crawlers, but very old or simple bots may not understand them.

You can test your robots.txt file in several ways:

Direct access: Visit yoursite.com/robots.txt in a browser to see if it loads correctly
Google Search Console: Use the robots.txt Tester tool (under Legacy tools & reports)
Online validators: Use free robots.txt testing tools to check syntax
Check formatting: Make sure there are no extra spaces, special characters, or encoding issues

Common testing mistakes:

Using a text editor that adds hidden characters or smart quotes
Saving with the wrong encoding (use UTF-8)
Having a typo in the filename (must be exactly "robots.txt")

Yes, it's highly recommended! Adding your sitemap URL to robots.txt helps search engines discover and crawl your site more efficiently.

Sitemap: https://example.com/sitemap.xml

Benefits:

Helps search engines find all your important pages
Provides a central reference for your site structure
Can specify multiple sitemaps if needed
No downside - it only helps search engines

You can list multiple sitemaps:

Sitemap: https://example.com/sitemap.xml
Sitemap: https://example.com/sitemap-news.xml
Sitemap: https://example.com/sitemap-images.xml

Here are the most common and costly robots.txt mistakes:

Blocking your entire site by accident

# DON'T DO THIS (unless you really want to block everything)
User-agent: *
Disallow: /

Blocking CSS and JavaScript - Google needs these to render your pages properly. Don't block /css/ or /js/ folders.
Using it as a security measure - As mentioned, robots.txt is not for security.
Incorrect file location - Must be in root directory, not in a subdirectory.
Case sensitivity errors - The filename must be lowercase "robots.txt", but paths can be case-sensitive depending on your server.
Forgetting to update after site changes - Review your robots.txt when you restructure your site.
Using relative URLs - Sitemap URLs must be absolute (include full domain).