OSINT Methodologies
This page describes the standard operating procedures for collecting, verifying, annotating, and archiving open-source intelligence for ingestion into HAP.
Purpose
Provide a repeatable, auditable, and lightweight set of procedures so contributors (human or AI) collect usable, verifiable material.
1. Collection
- Scope: Only public, legally obtainable sources. No hacking, credential abuse, or terms-of-service violations.
- Channels: Media outlets, government releases, think-tank reports, social media (verified accounts first), public datasets (ACLED, UN, IEA), geospatial imagery (public providers), web archives.
- Ingestion methods:
- RSS / Atom pulls (preferred for stability)
- API retrieval (where available)
- Periodic scraper jobs (robot.txt and ToS-aware)
- Manual entry for ephemeral or high-value items (press conferences, leaks)
- Minimum metadata captured on collection:
- Title, URL, publisher, author (if present)
- Publication date & ingest timestamp
- Collector (bot id or human handle)
- Confidence flag (Auto, Human-verified)
2. Verification & Corroboration
- Two-source corroboration rule: For a claim to be elevated to Confirmed in HAP, cite at least two independent sources (or one trusted primary + one corroborator).
- Source reliability scoring:
- Trusted (long track record, transparent sourcing)
- Medium (reputable but occasional errors)
- Low (unverified, new or partisan outlets)
- Validation steps:
- Check for primary documentation (official release, transcript, dataset).
- Cross-check timelines and quotes.
- Run a quick metadata check (whois, publication history, social account creation dates for authors).
- If geolocation or imagery is used, perform a reverse check (shadows, landmarks, timestamps).
3. Annotation & Tagging
- Required fields: Date, Source, Type, Short summary (1–3 lines), Tags, Archive link.
- Controlled vocabulary (initial):
- Policy divergence, EU pushback, US alignment with Russia, Diplomatic tension,
- Early indicator, Strategic risk, Collective stance, Sanctions change, Cyber incident, Resource shock
- Notes field: Analyst annotations; include reason for confidence score.
4. Archiving
- Archive formats: PDF snapshot (preferred), HTML archive, or screenshot when PDF unavailable.
- Storage & redundancy: Store each archive in at least two independent locations (primary server + object storage or encrypted external drive). Include SHA256 checksum in metadata.
- Retention policy: Keep raw archives indefinitely for high-SSS items; for low-value items, keep for 2 years unless later promoted.
5. Workflow (daily)
- Automated collectors run at scheduled cadence (hourly / 6-hour / daily depending on feed).
- Bot harvests push to
ingest_queuewith metadata. - Human analyst triages high-priority items, assigns reliability scores, and tags.
- Items flagged
HighorAnomalyare escalated to HAP for review and possible inclusion in trackers.
6. Tooling (starter list)
- RSS/Feed readers: self-hosted aggregator (e.g., Miniflux) or RSS-pull scripts
- Archival:
wget --mirror/curl+ PDF printer / webrecorder tools - Storage: PostgreSQL for metadata, object store for binary archives
- Analysis: Python (pandas + BeautifulSoup), Hugging Face lightweight models for NER/sentiment
- Visualization: Streamlit / Gradio prototypes for dashboards
7. Ethics & Legal
- Follow jurisdictional privacy laws and ToS.
- Avoid doxxing private individuals; redact PII unless it is publicly relevant and legally permissible.
- Document all collection decisions for audit.