Robots.txt Generator
Create and customize robots.txt files to control search engine crawlers and optimize your site's crawl budget.
Configuration
Preview
What is a Robots.txt File?
How Robots.txt Works
When a search engine bot (like Googlebot, Bingbot, or DuckDuckBot) visits your website, the first file it requests is robots.txt. This file is always located at https://yoursite.com/robots.txt (in the root directory). The bot reads the instructions in this file before crawling any other content on your site.
Basic syntax example:
User-agent: * Disallow: /admin/ Disallow: /private/ Allow: /admin/public/ Sitemap: https://example.com/sitemap.xml
This tells all bots (*) not to crawl the /admin/ and /private/ directories, except for /admin/public/ which is explicitly allowed. It also points bots to the sitemap for efficient crawling.
What Robots.txt Can and Cannot Do
- Can: Request that well-behaved search engines avoid crawling specific pages or directories
- Can: Prevent your server from being overloaded by excessive crawler requests
- Can: Point search engines to your sitemap for better discovery
- Can: Set different rules for different user-agents (Googlebot vs. Bingbot)
- Cannot: Prevent malicious bots from accessing your site (it's a request, not security)
- Cannot: Remove pages from search results (use meta noindex or password protection for that)
- Cannot: Protect sensitive information (robots.txt is publicly readable)
Important caveat: Robots.txt is a voluntary standard. Reputable search engines (Google, Bing, Yahoo, DuckDuckGo) honor robots.txt directives, but malicious scrapers, email harvesters, and bad actors often ignore it entirely. Never rely on robots.txt for security.
Why Robots.txt Matters for SEO & Crawl Budget
Crawl Budget Optimization
Search engines allocate a limited crawl budget to each website—the number of pages Googlebot or Bingbot will crawl in a given timeframe. For large sites (thousands of pages), crawl budget matters significantly. If bots waste time crawling unimportant pages (thank you pages, admin panels, duplicate content, infinite calendar pages), they may not discover your valuable content.
Impact on large sites: According to Google's Gary Illyes (2023), sites with more than 10,000 pages should actively manage crawl budget. A proper robots.txt file can increase crawled page efficiency by 30-50% by blocking crawler access to low-value URLs.
Example: An e-commerce site with 50,000 product pages plus 200,000 filter/sort variations (yoursite.com/products?sort=price&filter=blue&page=5) can use robots.txt to block query parameter URLs, ensuring Googlebot crawls actual products instead of infinite filter combinations.
Preventing Duplicate Content Issues
Many websites unintentionally create duplicate content through URL parameters, session IDs, printer-friendly pages, or staging environments. While canonical tags are the primary solution, robots.txt provides an additional layer by preventing crawlers from discovering these duplicates in the first place.
Common duplicate scenarios blocked via robots.txt:
Disallow: /*?*- Block all URLs with query parameters (use carefully)Disallow: /print/- Block printer-friendly versionsDisallow: /*sessionid=- Block session ID URLsDisallow: /staging/- Block development/staging areas
Server Load Management
Aggressive crawlers can strain server resources, especially on shared hosting or sites with complex database queries. While major search engines like Google automatically throttle their crawl rate based on server response times, smaller or more aggressive bots may not.
Real-world case: A WordPress blog on shared hosting experienced 70% server CPU spikes from an aggressive marketing crawler. Adding User-agent: SEMrushBot\nCrawl-delay: 30 reduced server load by 40% without affecting Google crawling.
Keeping Private Pages Private (But Not Secure)
If you have pages that aren't sensitive but shouldn't appear in search results (like internal search result pages, admin login pages, or user account dashboards), robots.txt can request that they not be crawled. However, this is NOT a security measure.
Critical warning: Ironically, listing URLs in robots.txt can advertise their existence to attackers. Many hackers read robots.txt files specifically to find admin panels and sensitive areas. For truly sensitive content, use:
- Server authentication: Password protection (.htaccess, server config)
- Meta robots noindex tag: Prevents indexing but allows crawling
- X-Robots-Tag HTTP header: Server-level indexing control
- Firewalls and IP restrictions: For admin areas
✓ Best for Large Sites
Sites with 10,000+ pages benefit most from robots.txt crawl budget optimization. Small sites (under 1,000 pages) are fully crawled regardless, but still benefit from blocking admin areas and duplicate content.
⚠️ Not a Security Tool
Robots.txt is publicly accessible and relies on voluntary compliance. Malicious bots ignore it. Never use robots.txt to hide sensitive data, API keys, or admin credentials.
📊 Crawl Budget Impact
Blocking low-value pages can increase crawl efficiency by 30-50% on large sites. More crawl budget for important pages = faster discovery of new content = better SEO outcomes.
How This Robots.txt Generator Works
Our Robots.txt Generator provides a user-friendly interface to create properly formatted robots.txt files without memorizing syntax or worrying about typos that could break crawling.
5-Step Generation Process
- Select User-Agent: Choose which bots your rules apply to. Select "All Bots (*)" for universal rules, or specify individual crawlers like Googlebot, Bingbot, or specific scrapers you want to control. You can create multiple rule sets for different user-agents.
- Configure Allow/Disallow Rules: Add paths you want to allow or block. Use wildcards for pattern matching:
Disallow: /*.pdf$blocks all PDF files,Disallow: /*?*blocks URLs with query parameters. The generator validates syntax in real-time. - Set Crawl Delay (Optional): Specify wait time in seconds between requests for a specific bot. Useful for slowing down aggressive crawlers. Note: Google ignores crawl-delay; adjust Google's rate in Search Console instead.
- Add Sitemap URLs: Include links to your XML sitemaps. This helps search engines discover your content structure efficiently. You can list multiple sitemaps (main sitemap, news sitemap, image sitemap, video sitemap).
- Preview & Download: Review the generated robots.txt code in the preview panel. Click "Copy Code" to paste directly into your website, or "Download File" to save as robots.txt and upload to your server's root directory.
Deployment & Testing
Upload Location: The robots.txt file must be placed in your website's root directory and accessible at https://yoursite.com/robots.txt. Case-sensitive: must be lowercase "robots.txt", not "Robots.txt" or "ROBOTS.TXT".
Testing your robots.txt:
- Direct browser test: Visit yoursite.com/robots.txt to confirm it loads
- Google Search Console: Use the robots.txt Tester (Legacy Tools & Reports section) to test specific URLs against your rules
- Syntax validation: Check for typos, encoding issues (use UTF-8), or unintended blocking
- Monitor Search Console: After deployment, watch for crawl errors or unexpected drops in indexed pages
Common deployment mistake: Uploading robots.txt to a subdirectory like yoursite.com/pages/robots.txt won't work. Bots only check the root directory. Each subdomain needs its own robots.txt (blog.yoursite.com/robots.txt is separate from yoursite.com/robots.txt).
Common Robots.txt Errors & How to Fix Them
Accidentally Blocking Entire Site
What it means: Using Disallow: / tells all crawlers not to index ANY pages on your site. This is the most catastrophic robots.txt mistake, causing complete disappearance from search results within weeks.
How to fix: Remove Disallow: / or replace with specific paths. Correct: Disallow: /admin/. If you want to allow everything, use Disallow: (blank) or omit the Disallow line entirely. After fixing, use Google Search Console to request re-indexing of important pages.
Impact: Real case (2023): E-commerce site accidentally deployed Disallow: / to production. Lost 94% organic traffic in 18 days before discovering the error. Took 6 weeks to fully recover rankings.
Blocking CSS, JavaScript, and Images
What it means: Rules like Disallow: /css/, Disallow: /js/, or Disallow: /images/ prevent Googlebot from rendering your pages properly. Google needs access to CSS and JS to understand page layout and mobile-friendliness.
How to fix: Never block CSS, JavaScript, or image directories. Google explicitly states in their robots.txt guidelines that blocking these resources "can hurt your site's rankings." Remove any Disallow rules for /css/, /js/, /assets/, /static/, or image folders.
Impact: Pages may be indexed but render poorly in mobile search results. Google's mobile-first indexing requires access to styling. Sites blocking CSS/JS saw 15-30% ranking drops in Google's mobile-first migration (2018-2023).
Wrong File Location or Capitalization
What it means: Robots.txt files not in the root directory (yoursite.com/robots.txt) or with incorrect capitalization (Robots.txt, ROBOTS.TXT) will be ignored by crawlers. Bots only check exactly yoursite.com/robots.txt (lowercase, root directory).
How to fix: Ensure file is at https://yoursite.com/robots.txt (not in /public/, /pages/, or subdirectories). File name must be lowercase: robots.txt. Test by visiting yoursite.com/robots.txt in a browser—it should display your rules as plain text. On Linux servers, filenames are case-sensitive.
Impact: If robots.txt is missing or in wrong location, crawlers assume full access permission (User-agent: *\nDisallow: default). While this isn't harmful for most sites, you lose crawl budget control and may get duplicate content issues.
Using Relative URLs for Sitemap Directive
What it means: Sitemap URLs in robots.txt must be absolute (full URLs), not relative paths. Sitemap: /sitemap.xml is invalid. Correct format: Sitemap: https://example.com/sitemap.xml.
How to fix: Always include full protocol and domain: Sitemap: https://yoursite.com/sitemap.xml. If you have multiple sitemaps, list each on separate lines. For subdomains, specify full URL: Sitemap: https://blog.yoursite.com/sitemap.xml. Test in Google Search Console under Sitemaps section.
Impact: Search engines may not discover or process your sitemap, slowing down indexing of new pages. Particularly harmful for large sites or sites with frequently updated content (news sites, e-commerce with daily product additions).
Syntax Errors (Typos, Spacing, Encoding)
What it means: Extra spaces, tabs, smart quotes, or wrong encoding can break robots.txt parsing. Common errors: Disallow:/admin/ (missing space after colon), User-agent : * (space before colon), or using curly quotes "" instead of straight quotes.
How to fix: Use plain text editor (Notepad++, VS Code, Sublime), NOT Microsoft Word or rich text editors that add hidden formatting. Ensure UTF-8 encoding. Format: Directive: value (directive, colon, space, value). Test with Google Search Console robots.txt Tester to identify parsing errors.
Impact: Malformed robots.txt may be partially or completely ignored. Crawlers might skip invalid lines, leading to unintended crawling behavior. Silent failures—your site appears fine but crawlers aren't following your rules.
Blocking Pages You Want Indexed
What it means: Overly broad Disallow rules unintentionally block important pages. Example: Disallow: /*? (blocking all query parameters) might block essential e-commerce filter pages or blog pagination if URLs use ?page=2 structure.
How to fix: Use specific Allow rules to create exceptions: Disallow: /*?\nAllow: /*?page= (blocks query params except pagination). Test individual important URLs with Google Search Console robots.txt Tester before deploying. For complex sites, start conservative (block only known-bad paths) rather than aggressive blanket blocking.
Impact: Important pages disappear from search results. Real case: News site blocked /*?utm_ to hide tracking URLs, accidentally blocking ?article_id= URLs, causing 62% of article pages to deindex. Took 3 months to recover lost rankings.
Real-World Robots.txt Examples
E-Commerce Site (Large Product Catalog)
Scenario: Online store with 25,000 products but 500,000+ filter/sort combinations (price ranges, colors, sizes, brands). Need to prevent Google from wasting crawl budget on infinite filter variations.
Robots.txt implementation:
User-agent: * Disallow: /*?sort= Disallow: /*?filter= Disallow: /*?color= Disallow: /*?price_min= Disallow: /checkout/ Disallow: /cart/ Disallow: /my-account/ Allow: /category/ Sitemap: https://example.com/sitemap.xml Sitemap: https://example.com/sitemap-products.xml
Result: Blocks crawler access to 480,000+ filter URLs, focusing crawl budget on actual product pages. Also blocks checkout flow and customer accounts (privacy + useless for search). Crawl efficiency improved 67%, new products indexed 3x faster.
WordPress Blog with Plugins
Scenario: Blogger using WordPress with 200 posts but thousands of generated pages from plugins (search results, comment feeds, author archives, date archives). Want to prevent duplicate content indexing.
Robots.txt implementation:
User-agent: * Disallow: /wp-admin/ Allow: /wp-admin/admin-ajax.php Disallow: /wp-includes/ Disallow: /wp-content/plugins/ Disallow: /wp-content/cache/ Disallow: /wp-content/themes/ Disallow: /?s= Disallow: /search/ Disallow: /author/ Disallow: */feed/ Disallow: */trackback/ Disallow: */attachment/ Sitemap: https://blog.com/sitemap.xml
Result: Blocks WordPress admin, plugin files, theme assets, search results, author pages, and feed URLs. Focuses Google on actual blog posts and pages. Reduced crawled URLs by 78%, focusing on content that matters for search rankings.
News Site (High-Frequency Publishing)
Scenario: News website publishing 50+ articles daily. Need fast indexing of new content, but want to block print versions, AMP cache, and archived content older than 2 years.
Robots.txt implementation:
User-agent: * Disallow: /print/ Disallow: /archive/2020/ Disallow: /archive/2021/ Disallow: /amp/cache/ Allow: /amp/ User-agent: Googlebot-News Crawl-delay: 0 User-agent: Bingbot Crawl-delay: 5 Sitemap: https://news.com/sitemap-news.xml Sitemap: https://news.com/sitemap-articles.xml Sitemap: https://news.com/sitemap-videos.xml
Result: Blocks printer-friendly pages and old archives (outdated news). Allows AMP pages but blocks AMP cache. Sets no crawl delay for Googlebot-News (prioritize fast indexing), 5-second delay for Bing. Multiple sitemaps for articles, videos, and news. New articles indexed within 15 minutes vs. previous 2-3 hour average.
SaaS Platform (Public + Private Sections)
Scenario: Software-as-a-Service company with public marketing site and customer login area. Want marketing pages indexed, but not app functionality, customer dashboards, or API documentation.
Robots.txt implementation:
User-agent: * Disallow: /app/ Disallow: /dashboard/ Disallow: /api/ Disallow: /admin/ Disallow: /login/ Disallow: /signup?* Allow: /signup$ Allow: /blog/ Allow: /pricing/ Allow: /features/ Sitemap: https://saas.com/sitemap.xml
Result: Blocks all app functionality, customer dashboards, API endpoints, and admin areas. Allows signup page (/signup) but blocks signup with query parameters (/signup?ref=) to prevent tracking URL indexing. Explicitly allows public marketing pages. Note: This is complemented by actual login requirements on /app/ and /dashboard/ (robots.txt alone is NOT security).
Frequently Asked Questions
If your website doesn't have a robots.txt file, search engines will assume they have permission to crawl all publicly accessible pages. This is equivalent to having a robots.txt file that allows everything:
User-agent: * Disallow:
While this isn't necessarily bad, having a robots.txt file gives you more control over how search engines interact with your site, and can help prevent them from wasting resources crawling unimportant pages.
No, robots.txt is not legally binding and relies on voluntary compliance.
Well-behaved search engines (Google, Bing, Yahoo, etc.) will respect your robots.txt directives. However, malicious bots, scrapers, and hackers often ignore robots.txt files completely. Think of robots.txt as a "polite request" rather than a security measure.
Important: Never use robots.txt to hide sensitive information. Malicious actors can read your robots.txt file to find pages you don't want crawled, making it a roadmap to your sensitive content.
Yes, there are a few potential downsides:
- Public visibility: Your robots.txt file is publicly accessible at yoursite.com/robots.txt. Anyone can see which paths you're blocking, potentially revealing the structure of your site or locations of admin areas.
- Not a security tool: Blocking a path in robots.txt doesn't prevent people from accessing it directly. It only asks search engines not to crawl it.
- Accidental blocking: A misconfigured robots.txt can accidentally block important pages from search engines, hurting your SEO.
- Pages can still be indexed: If other sites link to a blocked page, search engines might still index it (though they won't crawl its content).
Your robots.txt file must be placed in the root directory of your website and must be named exactly "robots.txt" (lowercase).
For example:
- ✅ Correct:
https://example.com/robots.txt - ❌ Incorrect:
https://example.com/pages/robots.txt - ❌ Incorrect:
https://example.com/Robots.txt - ❌ Incorrect:
https://example.com/robots.TXT
Each subdomain needs its own robots.txt file if you want different rules (e.g., blog.example.com/robots.txt is separate from example.com/robots.txt).
Disallow: Tells crawlers NOT to access a specific path.
Disallow: /admin/
Allow: Explicitly permits crawlers to access a path (useful for creating exceptions to broader Disallow rules).
Disallow: /admin/ Allow: /admin/public/
In the example above, all of /admin/ is blocked except /admin/public/ which is explicitly allowed.
Absolutely not! This is a critical security mistake.
Using robots.txt to block sensitive areas has several problems:
- The robots.txt file is publicly readable, so you're advertising where your sensitive content is located
- Malicious bots don't respect robots.txt and will specifically target blocked areas
- Direct links to blocked pages still work - robots.txt doesn't prevent access
Correct approach: Use proper authentication, passwords, server configuration, or firewalls to protect sensitive content.
A user-agent is the name/identifier of the bot or crawler. Different search engines use different user-agent names:
- Googlebot - Google's web crawler
- Bingbot - Microsoft Bing's crawler
- Slurp - Yahoo's crawler
- DuckDuckBot - DuckDuckGo's crawler
- * (asterisk) - Matches all bots
You can specify different rules for different crawlers. For example, you might want to slow down a particularly aggressive crawler with a crawl-delay while allowing others full access.
Crawl-delay specifies the number of seconds a crawler should wait between requests to your server. For example:
Crawl-delay: 10
This tells the bot to wait 10 seconds between each page request.
When to use it:
- Your server is struggling with bot traffic
- You want to reduce server load from crawlers
- A specific bot is being too aggressive
Note: Google ignores Crawl-delay. Instead, use Google Search Console to adjust Googlebot's crawl rate. Bing and other search engines do respect this directive.
Yes! Most modern crawlers support wildcard patterns:
- * (asterisk) - Matches any sequence of characters
- $ (dollar sign) - Matches the end of a URL
Examples:
# Block all URLs with "?" (query parameters) Disallow: /*? # Block all .pdf files Disallow: /*.pdf$ # Block all URLs containing "admin" Disallow: /*admin*/
Note: Wildcards are supported by Google, Bing, and most modern crawlers, but very old or simple bots may not understand them.
You can test your robots.txt file in several ways:
- Direct access: Visit yoursite.com/robots.txt in a browser to see if it loads correctly
- Google Search Console: Use the robots.txt Tester tool (under Legacy tools & reports)
- Online validators: Use free robots.txt testing tools to check syntax
- Check formatting: Make sure there are no extra spaces, special characters, or encoding issues
Common testing mistakes:
- Using a text editor that adds hidden characters or smart quotes
- Saving with the wrong encoding (use UTF-8)
- Having a typo in the filename (must be exactly "robots.txt")
Yes, it's highly recommended! Adding your sitemap URL to robots.txt helps search engines discover and crawl your site more efficiently.
Sitemap: https://example.com/sitemap.xml
Benefits:
- Helps search engines find all your important pages
- Provides a central reference for your site structure
- Can specify multiple sitemaps if needed
- No downside - it only helps search engines
You can list multiple sitemaps:
Sitemap: https://example.com/sitemap.xml Sitemap: https://example.com/sitemap-news.xml Sitemap: https://example.com/sitemap-images.xml
Here are the most common and costly robots.txt mistakes:
- Blocking your entire site by accident
# DON'T DO THIS (unless you really want to block everything) User-agent: * Disallow: /
- Blocking CSS and JavaScript - Google needs these to render your pages properly. Don't block /css/ or /js/ folders.
- Using it as a security measure - As mentioned, robots.txt is not for security.
- Incorrect file location - Must be in root directory, not in a subdirectory.
- Case sensitivity errors - The filename must be lowercase "robots.txt", but paths can be case-sensitive depending on your server.
- Forgetting to update after site changes - Review your robots.txt when you restructure your site.
- Using relative URLs - Sitemap URLs must be absolute (include full domain).