Employee OSINT Profiling: Corporate Web Presence Analysis and Identity Correlation

forensic_file_artifacts Difficulty 1–5 30 min certifiable

Theory

Why This Matters

Corporate websites are intentionally designed to convey trust and legitimacy — and in doing so, they inadvertently publish extensive intelligence about the people, technologies, and internal structure of the organisation. Social engineering campaigns, spear-phishing operations, and business email compromise fraud all depend on accurate staff intelligence: knowing names, roles, photographs, and communication patterns dramatically increases the success rate of impersonation and pretexting. Investigative journalists have used employee OSINT to identify undisclosed executives at shell companies; fraud investigators have correlated staff photographs with LinkedIn profiles to expose fake company websites; red-team operators have used job posting language analysis to enumerate the security tooling an organisation has deployed. This card teaches the systematic collection of employee intelligence from public corporate web presence.

Core Concept

Employee OSINT from corporate websites exploits several categories of intentional disclosure: team pages listing names and photographs, job postings describing internal technology, document metadata embedding author identities, and code repositories exposing contributor accounts.

Reverse image search is the pivot from a staff photograph to a comprehensive identity profile. Corporate headshots uploaded to /team or /about pages are frequently reused across LinkedIn, conference speaking profiles, GitHub accounts, and personal websites. Yandex reverse image search is generally considered more effective than Google's for facial recognition of professional photographs; both should be queried for comprehensive coverage.

PDF document metadata is a systematic disclosure channel that most organisations overlook. Microsoft Office applications embed the document author's Windows username or full name in every saved document. Adobe Acrobat embeds the application version and sometimes the author's name. The exiftool utility extracts all embedded metadata fields, revealing author names, software versions (which identify the organisation's desktop application stack), and creation/modification timestamps (which reveal working hours and timezone).

Job posting language analysis is a disciplined intelligence extraction technique. A posting for "Senior Splunk SIEM Engineer with experience tuning correlation searches" confirms that Splunk is in production. A posting for "Palo Alto Prisma Cloud Administrator" confirms the cloud security platform. A posting for "Endpoint Detection and Response Analyst (CrowdStrike)" confirms the EDR vendor. Collectively, job postings reconstruct the organisation's security tooling inventory without any active probing.

GitHub organisation enumeration accesses the people listing at https://github.com/orgs/ORGNAME/people (public members only), providing GitHub usernames that can be further pivoted to email addresses (from git commit history), personal repositories, and linked social accounts.

Technical Deep-Dive

# Step 1: Scrape /team, /about, /leadership pages for names and roles
wget -q -O- "https://www.targetcorp.com/team" 
  | grep -oP '(?<=<h[23]>)[^<]+(?=</h[23])' | head -30
# Or use a more robust scraper:
curl -s "https://www.targetcorp.com/about" 
  | python3 -c "
import sys; from html.parser import HTMLParser
class P(HTMLParser):
    def handle_data(self, d):
        if d.strip(): print(d.strip())
P().feed(sys.stdin.read())
" | grep -E '^[A-Z][a-z]+ [A-Z][a-z]+' | head -20

# Step 2: PDF metadata extraction
# Download all indexed PDFs
wget -q -r -l 2 -A pdf -P /tmp/corppdfs/ "https://www.targetcorp.com/" 2>/dev/null
# Extract all metadata fields
find /tmp/corppdfs/ -name "*.pdf" -print0 | xargs -0 -I{} exiftool -Author -Creator 
  -Producer -Software -CreateDate -ModifyDate "{}"

# Step 3: GitHub org member enumeration
ORG="targetcorp"
# Public member listing (requires GitHub auth for private members)
curl -s "https://api.github.com/orgs/${ORG}/members?per_page=100" 
  -H "Authorization: Bearer $GH_TOKEN" 
  | jq -r '.[].login' > github_members.txt

# For each member, extract commit emails from public repos
while read username; do
  repos=$(curl -s "https://api.github.com/users/${username}/repos?per_page=30" 
    -H "Authorization: Bearer $GH_TOKEN" | jq -r '.[].clone_url')
  for repo in $repos; do
    git clone --quiet "$repo" "/tmp/ghrepos/${username}_$(basename $repo)" 2>/dev/null
    git -C "/tmp/ghrepos/${username}_$(basename $repo)" log 
      --format='%ae' --all 2>/dev/null | grep "@" | sort -u
  done
done < github_members.txt

# Step 4: Reverse image search automation (Yandex — manual, no public API)
# Download staff photos:
curl -s "https://www.targetcorp.com/team" | grep -oP 'src="[^"]+.(jpg|jpeg|png)"' 
  | sed 's/src="//;s/"//' | while read imgpath; do
  wget -q "https://www.targetcorp.com${imgpath}" -P /tmp/staffphotos/
done
# Submit each to https://yandex.com/images/search?rpt=imageview (manual)

# Step 5: Job posting analysis for technology stack intelligence
# Scrape job postings (or use Google dork: site:targetcorp.com/careers)
curl -s "https://www.targetcorp.com/careers" 
  | grep -oiE "(splunk|crowdstrike|palo alto|fortinet|qualys|tenable|sentinelone|
  azure|aws|gcp|kubernetes|terraform|ansible|puppet|chef|jenkins|github actions)" 
  | sort | uniq -c | sort -rn

# Step 6: Crunchbase/ZoomInfo for sales team and executive discovery
# Crunchbase (free tier): https://www.crunchbase.com/organization/targetcorp/people
# ZoomInfo (requires account): search by company name, export contacts

# Step 7: Email format inference and validation
# Collect all confirmed names from steps above
# Apply top 3 formats from hunter.io:
curl -s "https://api.hunter.io/v2/domain-search?domain=targetcorp.com&api_key=$HUNTER_KEY" 
  | jq '{pattern: .data.pattern, sample_emails: [.data.emails[:3][].value]}'

# Sample exiftool output from annual_report_2024.pdf:
# File Name         : annual_report_2024.pdf
# Author            : Sarah Mitchell
# Creator           : Microsoft Word for Microsoft 365
# Producer          : Microsoft: Print To PDF
# Create Date       : 2024:03:15 09:42:11+00:00
# Modify Date       : 2024:03:22 14:17:55+00:00
# Software          : Microsoft Office Word 2019
# Intelligence: Author name + role inference + software version + working hours TZ

Intelligence Collection Methodology

Map all pages of the corporate website that list employees: /team, /about, /about-us, /leadership, /board, /our-people. Use recon-ng's use recon/domains-hosts/google_site_web to enumerate additional pages via Google's index.
For each identified name, record: full name, job title, department, photograph URL, any social media links in their profile. Build a structured spreadsheet with one row per person.
Download all staff photographs. Submit each to Yandex reverse image search and Google Images reverse search. Record any matches on LinkedIn, conference sites, GitHub profiles, or personal sites.
Download all PDFs linked from the website using wget recursive crawl (-A pdf -l 2). Run exiftool on every PDF. Extract Author, Creator, Software, CreateDate fields. Consolidate author names into the staff spreadsheet.
Visit the careers/jobs page. Copy the full text of every active job posting into a text file. Run a keyword frequency analysis for security tools, cloud platforms, compliance frameworks, and internal systems. This produces the technology inventory.
Enumerate the GitHub organisation: list public members, clone their public repositories within the org, extract commit author emails with git log --format="%ae" --all | sort -u. Cross-reference discovered emails with Hunter.io.
Search Crunchbase and LinkedIn for the sales team (SDRs, AEs, CSMs frequently publish personal contact details to facilitate prospects reaching them). Sales staff are often the least security-aware employees and the richest source of internal email format examples.
Apply the confirmed email format to all names in the staff spreadsheet. Validate each address with Hunter.io's /email-verifier endpoint. Flag all addresses with score > 80 as high-confidence.
Submit all high-confidence addresses to HaveIBeenPwned. Tag any with breach hits and cross-reference with the technology inventory — a SIEM administrator with a breached credential is a critical finding.

Common Intelligence Collection Errors

Missing employees who suppress their LinkedIn visibility: Not all staff appear on LinkedIn or on the company website. GitHub commit emails, conference speaker bios, academic publications, and press quote attribution can reveal employees who have no visible social media presence.
Not checking the Wayback Machine for removed team pages: Organisations sometimes remove team pages or specific employee listings after staff departures or security incidents. The Internet Archive's Wayback Machine frequently retains earlier versions with full staff listings that are no longer live.
Assuming all PDF authors are current employees: Document templates are passed between employees; the Author field reflects whoever last saved the document in Microsoft Word. Always cross-reference author names against the live staff listing before treating a PDF author as a current employee.
Ignoring document modification timestamps for timezone inference: PDF CreateDate and ModifyDate fields are in the author's local timezone. A series of documents consistently modified between 09:00 and 18:00 UTC+5:30 strongly indicates an India-based office — useful context for org structure inference.
Using job posting analysis without noting posting dates: A job posting for a "Splunk Engineer" dated two years ago may reflect an already-filled role and a mature Splunk deployment. A recent posting for the same role may indicate a new deployment, unfilled gap, or recent departure — quite different intelligence values.
Not correlating GitHub usernames with personal email domains: Developers often commit with personal email addresses (Gmail, Outlook) rather than corporate addresses. These personal addresses can be checked against HIBP and searched across social platforms using holehe to expand the identity profile beyond the corporate footprint.

NICE Framework Alignment

Code	Knowledge/Skill/Task Statement	How This Card Develops It
K0058	Knowledge of network protocols	Using HTTP, DNS, and GitHub API endpoints to systematically collect employee intelligence from public web sources
K0145	Knowledge of security assessment approaches	Applying a structured multi-source collection methodology: web scraping, metadata extraction, image search, and repository enumeration
K0272	Knowledge of network security architecture	Inferring the organisation's security tool stack from job postings and correlating with network service exposure data
K0427	Knowledge of encryption algorithms	Interpreting software version information in PDF metadata to assess the organisation's patch currency and application stack
S0040	Skill in identifying and extracting data of interest	Extracting author identities from PDF metadata, employee names from web pages, and technology signals from job posting language
T0569	Apply and utilize authorized cyber capabilities to achieve objectives	Executing systematic employee OSINT from corporate web presence as part of an authorised social engineering preparation phase