02-17-2025, 08:36 PM
This guide will explain how to use the Wayback Machine (web.archive.org) to find sensitive files, old endpoints, and other potentially valuable information for security testing.
1. Extracting Archived URLs with Web Archive The Wayback Machine stores historical snapshots of websites, including pages and files that might no longer be publicly accessible. By querying its index, you can retrieve archived URLs for a specific domain.
Querying Web Archive via API
Use the following cURL command to extract archived URLs for a target domain:
Example:
Once the URLs are retrieved, you can process them further to find potentially interesting files.
2. Searching for Sensitive Files After extracting the URLs, use the following command to filter out files that might contain sensitive information:
This filters common sensitive file types such as databases, configuration files, scripts, certificates, documents, backups, and logs.
3. Gathering Additional Intelligence To enhance reconnaissance, you can use additional sources such as VirusTotal and AlienVault OTX to gather information about a domain.
VirusTotal API Lookup
VirusTotal allows querying a domain to check for known malware, open ports, and related URLs:
Replace
example.com
with your target domain. This requires an API key from VirusTotal (free tier available).
AlienVault OTX Intelligence
AlienVault OTX provides intelligence on domains, including linked URLs, indicators of compromise (IoCs), and threat reports:
Replace
example.com
with your target domain. This retrieves up to 500 URLs associated with the domain.
4. Applying These Techniques
Use Cases in Bug Hunting and Security Research
1. Extracting Archived URLs with Web Archive The Wayback Machine stores historical snapshots of websites, including pages and files that might no longer be publicly accessible. By querying its index, you can retrieve archived URLs for a specific domain.
Querying Web Archive via API
Use the following cURL command to extract archived URLs for a target domain:
curl -G "https://web.archive.org/cdx/search/cdx" \
--data-urlencode "url=*.example.com/*" \
--data-urlencode "collapse=urlkey" \
--data-urlencode "output=text" \
--data-urlencode "fl=original" > out.txt
- url=*.example.com/*
fetches all subdomains and paths of the target domain.
- collapse=urlkey
ensures only unique URLs are retrieved.
- output=text
outputs raw URLs.
- fl=original
extracts only the original URLs.
- The result is saved into
out.txt.
Example:
example$ curl -G "https://web.archive.org/cdx/search/cdx" \
--data-urlencode "url=*.deepseek.com./*" \
--data-urlencode "collapse=urlkey" \
--data-urlencode "output=text" \
--data-urlencode "fl=original" > out.txt
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 99k 0 99k 0 0 37056 0 --:--:-- 0:00:02 --:--:-- 37049
example$ cat out.txt
http://www.deepseek.com:80/
https://www.deepseek.com/%0A
https://www.deepseek.com/%0A%0A
https://www.deepseek.com/"
https://www.deepseek.com/?gad_source=1
https://www.deepseek.com/?ref=ducttapemarketing
https://www.deepseek.com/?ref=localhost
https://www.deepseek.com/?ref=prompt.cn
https://www.deepseek.com/?ref=testingcatalog.com
https://www.deepseek.com/?utm_source=www.theautomated.co&utm_medium=referral&utm_campaign=china-s-new-ai-challenger-deepseek-r1
https://www.deepseek.com/?utm_source=www.aidrop.news&utm_medium=referral&utm_campaign=figure-bmw-a-frota-de-robos-autonomos
https://www.deepseek.com/?utm_medium=referral&utm_campaign=topic-10-inside-deepseek-models&utm_source=www.turingpost.com
https://www.deepseek.com/?utm_source=tap4-ai&utm_medium=referral
https://www.deepseek.com/?utm_source=chatgpt.com
https://www.deepseek.com/?utm_source=www.turingpost.com
https://www.deepseek.com/%5Cn
https://www.deepseek.com/_next/image?url=https%3A%2F%2Fcdn.deepseek.com%2Flogo.png&w=828&q=75%201x,%20/_next/image?url=https%3A%2F%2Fcdn.deepseek.com%2Flogo.png&
https://www.deepseek.com/_next/image?url=https%3A%2F%2Fcdn.deepseek.com%2Flogo.png&w=828&q=75%201x,%20/_next/image?url=https%3A%2F%2Fcdn.deepseek.com%2Flogo.png&w=1920&q=75%202x
https://www.deepseek.com/_next/image?url=https%3A%2F%2Fcdn.deepseek.com%2Flogo.png&w=1920&q=75
https://www.deepseek.com/_next/image?url=https%3A%2F%2Fcdn.deepseek.com%2Flogo.png&w=828&q=75
https://www.deepseek.com/_next/image?url=https%3A%2F%2Fcdn.deepseek.com%2Flogo_2.png%3Fv%3D1&w=1080&q=75
https://www.deepseek.com/_next/image?url=https%3A%2F%2Fcdn.deepseek.com%2Flogo_2.png%3Fv%3D1&w=640&q=75
https://www.deepseek.com/_next/image%3Furl%3Dhttps://cdn.deepseek.com/logo.png%26w%3D1920%26q%3D75
<SNIP>
Once the URLs are retrieved, you can process them further to find potentially interesting files.
2. Searching for Sensitive Files After extracting the URLs, use the following command to filter out files that might contain sensitive information:
cat out.txt | grep -E '.(xls|xml|xlsx|json|pdf|sql|doc|docx|pptx|txt|zip|tar.gz|tgz|bak|7z|rar|log|cache|secret|db|backup|yml|yaml|md|md5|exe|dll|bin|ini|bat|sh|tar|deb|rpm|iso|img|apk|msi|dmg|tmp|crt|pem|key|pub|asc)'
3. Gathering Additional Intelligence To enhance reconnaissance, you can use additional sources such as VirusTotal and AlienVault OTX to gather information about a domain.
VirusTotal API Lookup
VirusTotal allows querying a domain to check for known malware, open ports, and related URLs:
curl -s "[url=https://www.virustotal.com/vtapi/v2/domain/report?apikey=YOUR_API_KEY&domain=example.com]https://www.virustotal.com/vtapi/v2/domain/report?apikey=YOUR_API_KEY&domain=example.com[/url]"
example.com
with your target domain. This requires an API key from VirusTotal (free tier available).
AlienVault OTX Intelligence
AlienVault OTX provides intelligence on domains, including linked URLs, indicators of compromise (IoCs), and threat reports:
curl -s "[url=https://otx.alienvault.com/api/v1/indicators/hostname/example.com/url_list?limit=500&page=1]https://otx.alienvault.com/api/v1/indicators/hostname/example.com/url_list?limit=500&page=1[/url]"
example.com
with your target domain. This retrieves up to 500 URLs associated with the domain.
4. Applying These Techniques
Use Cases in Bug Hunting and Security Research
- Finding forgotten endpoints that companies have removed but left traces in Web Archive.
- Extracting old exposed files such as backups, config files, logs, and documentation.
- Checking for sensitive information leaks, including publicly visible API keys, credentials, or database connection strings in old site versions.
- Enumerating subdomains and endpoints to expand the attack surface.