Skip to content

OSINT Methodologies

This page describes the standard operating procedures for collecting, verifying, annotating, and archiving open-source intelligence for ingestion into HAP.

Purpose

Provide a repeatable, auditable, and lightweight set of procedures so contributors (human or AI) collect usable, verifiable material.

1. Collection

  • Scope: Only public, legally obtainable sources. No hacking, credential abuse, or terms-of-service violations.
  • Channels: Media outlets, government releases, think-tank reports, social media (verified accounts first), public datasets (ACLED, UN, IEA), geospatial imagery (public providers), web archives.
  • Ingestion methods:
  • RSS / Atom pulls (preferred for stability)
  • API retrieval (where available)
  • Periodic scraper jobs (robot.txt and ToS-aware)
  • Manual entry for ephemeral or high-value items (press conferences, leaks)
  • Minimum metadata captured on collection:
  • Title, URL, publisher, author (if present)
  • Publication date & ingest timestamp
  • Collector (bot id or human handle)
  • Confidence flag (Auto, Human-verified)

2. Verification & Corroboration

  • Two-source corroboration rule: For a claim to be elevated to Confirmed in HAP, cite at least two independent sources (or one trusted primary + one corroborator).
  • Source reliability scoring:
  • Trusted (long track record, transparent sourcing)
  • Medium (reputable but occasional errors)
  • Low (unverified, new or partisan outlets)
  • Validation steps:
  • Check for primary documentation (official release, transcript, dataset).
  • Cross-check timelines and quotes.
  • Run a quick metadata check (whois, publication history, social account creation dates for authors).
  • If geolocation or imagery is used, perform a reverse check (shadows, landmarks, timestamps).

3. Annotation & Tagging

  • Required fields: Date, Source, Type, Short summary (1–3 lines), Tags, Archive link.
  • Controlled vocabulary (initial):
  • Policy divergence, EU pushback, US alignment with Russia, Diplomatic tension,
  • Early indicator, Strategic risk, Collective stance, Sanctions change, Cyber incident, Resource shock
  • Notes field: Analyst annotations; include reason for confidence score.

4. Archiving

  • Archive formats: PDF snapshot (preferred), HTML archive, or screenshot when PDF unavailable.
  • Storage & redundancy: Store each archive in at least two independent locations (primary server + object storage or encrypted external drive). Include SHA256 checksum in metadata.
  • Retention policy: Keep raw archives indefinitely for high-SSS items; for low-value items, keep for 2 years unless later promoted.

5. Workflow (daily)

  1. Automated collectors run at scheduled cadence (hourly / 6-hour / daily depending on feed).
  2. Bot harvests push to ingest_queue with metadata.
  3. Human analyst triages high-priority items, assigns reliability scores, and tags.
  4. Items flagged High or Anomaly are escalated to HAP for review and possible inclusion in trackers.

6. Tooling (starter list)

  • RSS/Feed readers: self-hosted aggregator (e.g., Miniflux) or RSS-pull scripts
  • Archival: wget --mirror / curl + PDF printer / webrecorder tools
  • Storage: PostgreSQL for metadata, object store for binary archives
  • Analysis: Python (pandas + BeautifulSoup), Hugging Face lightweight models for NER/sentiment
  • Visualization: Streamlit / Gradio prototypes for dashboards
  • Follow jurisdictional privacy laws and ToS.
  • Avoid doxxing private individuals; redact PII unless it is publicly relevant and legally permissible.
  • Document all collection decisions for audit.