Browse CTFs New CTF Sign in

Git-to-S3 Infrastructure OSINT: Repository Credential Pivoting to Cloud Storage Data Extraction

forensic_file_artifacts Difficulty 1–5 30 min certifiable

Theory

Why This Matters

Public code repositories are among the highest-yield intelligence sources in cloud infrastructure reconnaissance. The Uber 2022 breach began with a threat actor who purchased credentials from a dark web marketplace and found additional AWS credentials in a PowerShell script in a private repository that a contractor had inadvertently made public. The Twitch 2021 breach exposed source code — including hardcoded credentials — in a 125GB data dump that originated from an internal Git repository. Beyond breaches, bug bounty hunters routinely earn $10,000–$50,000 payouts by finding AWS access keys in public GitHub repositories and using them to demonstrate S3 bucket access. Understanding how to systematically mine git repositories for cloud storage references is a fundamental cloud security assessment skill for both red team operators and defenders building detection capabilities.

Core Concept

Git repository intelligence exploits two fundamental properties of version control systems: history persistence and content searchability. Unlike a live file, a git repository preserves every version of every file since the repository was created. A secret committed three years ago and subsequently deleted still exists in git history and is recoverable by anyone with access to the repository. GitHub code search provides direct content search across all public repositories on GitHub, making it possible to search for specific string patterns — credential prefixes, bucket names, environment variable names — across millions of repositories simultaneously.

AWS credential patterns in code include: environment variable assignments (AWS_ACCESS_KEY_ID=, AWS_SECRET_ACCESS_KEY=), configuration file entries (aws_access_key_id =), and S3 bucket references (s3://bucket-name, BUCKET_NAME=, s3.amazonaws.com/bucket). GitHub code search queries such as org:company-name AWS_ACCESS_KEY_ID or org:company-name s3.amazonaws.com search all repositories in an organization's GitHub org for these patterns.

Exposed .env files in repositories are a particularly common finding. Developers frequently commit .env files containing all environment variables — including cloud credentials — and rely on .gitignore to prevent future commits. However, if the .env file was committed before the .gitignore entry was added, it persists in git history even after deletion from the working tree.

GitLeaks is a static analysis tool for detecting secrets in git repositories. It scans every commit across all branches using a comprehensive ruleset of credential patterns. GitLeaks GitHub Actions integration (zricethezav/gitleaks-action) runs GitLeaks on every push and pull request, blocking commits that contain detected secrets. This makes GitLeaks both an attack tool (for analyzing discovered repositories) and a defense tool (for CI/CD pipeline integration).

truffleHog complements GitLeaks with entropy analysis: in addition to pattern matching, truffleHog identifies high-entropy string sequences that are likely to be secrets even if they do not match known patterns. This catches custom API keys and randomly generated tokens that no regex pattern covers.

Technical Deep-Dive

# GitHub code search for AWS credentials in a specific organization
# (Requires GitHub authentication — free account sufficient)
# Via GitHub CLI:
gh search code "AWS_ACCESS_KEY_ID" --owner=target-org --limit=50 
  --json path,repository,textMatches | python3 -m json.tool

# Search for S3 bucket references in organization repositories
gh search code "s3.amazonaws.com" --owner=target-org --limit=50 
  --json path,repository,textMatches | 
  python3 -c "
import json, sys, re
results = json.load(sys.stdin)
buckets = set()
for r in results:
    for tm in r.get('textMatches', []):
        fragment = tm.get('fragment', '')
        # Extract bucket names from s3:// URIs and .s3.amazonaws.com hostnames
        buckets.update(re.findall(r's3://([a-z0-9][a-z0-9-.]{2,62})', fragment))
        buckets.update(re.findall(r'([a-z0-9][a-z0-9-.]{2,62}).s3.amazonaws.com', fragment))
for b in sorted(buckets):
    print(b)
"
# GitLeaks: scan a cloned repository for secrets (all history)
pip install detect-secrets || pip3 install gitleaks
# Or use the binary:
gitleaks detect --source=./cloned-repo/ --report-format=json 
  --report-path=gitleaks-report.json --no-git=false

# Parse findings
python3 - <<'EOF'
import json
with open("gitleaks-report.json") as f:
    findings = json.load(f)
for f in findings:
    print(f"[{f.get('RuleID','?'):30s}] {f.get('File','?')}:{f.get('StartLine','?')} commit:{f.get('Commit','?')[:8]}")
EOF

# truffleHog: full history scan with entropy analysis
trufflehog git file://./cloned-repo/ --json --no-verification | 
  python3 -m json.tool | grep -E '"DetectorName"|"Raw"|"SourceMetadata"' | head -60

# truffleHog: scan an entire GitHub organization
trufflehog github --org=target-org 
  --token="$GITHUB_TOKEN" 
  --json --no-verification 2>/dev/null | 
  python3 -c "
import sys, json
for line in sys.stdin:
    try:
        r = json.loads(line)
        if r.get('Raw'):
            print(f"{r.get('DetectorName','?'):20s} | {r.get('SourceMetadata',{}).get('Data',{}).get('Git',{}).get('repository','?')}")
    except:
        pass
"
# Enumerate S3 buckets discovered from code repositories
# Without credentials (unauthenticated):
aws s3 ls s3://discovered-bucket-name --no-sign-request
# With credentials:
aws s3 ls s3://discovered-bucket-name

# s3scanner: permission assessment across a list of discovered bucket names
pip install s3scanner
# From a file of bucket names (one per line):
s3scanner scan --bucket-file discovered_buckets.txt --threads 10

# Check for .git directory or git bundle in a bucket
aws s3 ls s3://discovered-bucket-name --recursive | grep -E '.git|.bundle|.pack|COMMIT_EDITMSG'
# Download a discovered .git directory
aws s3 sync s3://discovered-bucket-name/.git/ ./recovered.git/ --no-sign-request
# Restore the working tree
git --git-dir=./recovered.git/ checkout HEAD -- .

Intelligence Collection Methodology

  1. Enumerate target GitHub/GitLab organizations: Use recon-ng module recon/profiles-profiles/github_users or the GitHub API to list all repositories in the target organization. Note repository counts, last push dates, and visibility (public/private). Private repos are out of scope without authorization; focus on public repos.
  2. Run GitHub code search for credential indicators: Use gh search code or the GitHub web interface to search the organization for patterns: AWS_ACCESS_KEY_ID, SECRET_ACCESS_KEY, s3.amazonaws.com, BUCKET_NAME, .env. Log every matching file path, repository, and commit.
  3. Clone all relevant public repositories: For each repository with positive search results, git clone the full repository including all history (git clone --mirror to get all refs). This preserves branches and tags that may not be on the default branch.
  4. Run GitLeaks and truffleHog over all cloned repositories: Execute gitleaks detect --source=./repo/ for fast pattern-based scanning. Follow with trufflehog git file://./repo/ --json for entropy-based scanning. Aggregate findings from both tools — they catch different secret types.
  5. Extract all S3 bucket name references from code and findings: Use regex to extract bucket names from all found credentials, configuration files, and code. Include patterns: s3://BUCKET, BUCKET.s3.amazonaws.com, BUCKET_NAME="...", environment variable assignments.
  6. Test each discovered bucket name with s3scanner: Run s3scanner scan --bucket-file buckets.txt to determine which buckets exist and which are publicly accessible. For buckets that are publicly listable, enumerate their contents with aws s3 ls --recursive --no-sign-request.
  7. Search for git repositories stored in S3 buckets: For publicly accessible buckets, look for .git/ directories, .bundle files, and .pack files. Download and restore any discovered git repositories, then run GitLeaks and truffleHog on the recovered history.
  8. Check .gitignore patterns for hidden secrets: Review each repository's .gitignore file. Entries like .env, *.pem, credentials.json, *.key indicate that secret file types exist in the project. Their git history may contain the files before they were added to .gitignore.
  9. Validate discovered credentials: For any found AWS credentials, validate with aws sts get-caller-identity before any further use. Document the finding with: source repository URL, specific file path, commit hash, finding type, and credential validity status.

Common Intelligence Collection Errors

  • Searching only the default branch (main/master): Developers commonly commit secrets to feature branches, hotfix branches, or stale development branches that are never merged and never cleaned. Always scan all branches: git branch -a lists all remote branches; GitLeaks and truffleHog scan all refs by default.
  • Missing secrets in git stash: git stash saves uncommitted work — including secrets in modified tracked files or untracked files added with git stash -u. Run git stash list and git stash show -p stash@{0} to inspect stashed content. GitLeaks does not scan stashes by default.
  • Treating GitLeaks output as definitive: GitLeaks pattern matching produces both false positives (test credential strings, example values) and false negatives (novel secret formats not in the ruleset). Always validate significant findings manually before reporting. Rotate credentials that are confirmed valid; do not assume a finding is a test value without verifying.
  • Ignoring git submodules: Repositories with submodules reference other repositories at specific commits. The submodule may itself contain secrets in its history. Always run git submodule update --init --recursive after cloning and scan submodule directories separately.
  • Missing secrets in binary files: GitLeaks and truffleHog primarily analyze text. Binary files in git history (SQLite databases, ZIP archives, compiled configurations) may contain embedded credentials. Use strings on binary objects and binwalk for embedded archive extraction as supplementary analysis.
  • Failing to re-check after public repository discovery: Organizations that discover a public repository containing secrets frequently make the repository private without rotating the credentials. If a repository was indexed by a search engine or security scanner while public, the credentials are already compromised. Rotation must happen regardless of repository visibility change.

NICE Framework Alignment

Code Knowledge/Skill/Task Statement How This Card Develops It
K0058 Knowledge of network protocols Understanding GitHub API authentication, S3 REST API structure, and HTTP-based git clone protocol as collection mechanisms
K0145 Knowledge of security assessment approaches Applying systematic git history mining and S3 bucket validation as structured cloud security assessment steps
K0272 Knowledge of network security architecture Mapping how public git repositories, S3 bucket naming conventions, and CI/CD pipeline secrets create an interconnected cloud attack surface
K0427 Knowledge of encryption algorithms Understanding how truffleHog's Shannon entropy analysis identifies high-entropy credential strings that pattern matching misses
S0040 Skill in identifying and extracting data of interest from various sources Extracting S3 bucket names, AWS credentials, and secret file references from git history using GitLeaks, truffleHog, and regex
T0569 Apply and utilize authorized cyber capabilities to achieve objectives Deploying GitLeaks, truffleHog, s3scanner, and GitHub CLI in authorized repository intelligence and cloud storage assessment

Further Reading

  • Bug Bounty Bootcamp — Vickie Li, Chapter 6: Code Repository Intelligence (No Starch Press)
  • Hacking the Cloud — GitLeaks and Secret Scanning — Nick Frichette (hackingthe.cloud)
  • The Web Application Hacker's Handbook, 2nd Edition — Stuttard & Pinto, Chapter 21: Source Code Assessment (Wiley)

Challenge Lab

Reinforce your learning with a hands-on generated challenge based on this card's competency.