For AI Companies

Why this matters

The regulatory landscape around AI training data is tightening. The EU AI Act requires documented data governance. The US Copyright Office is actively evaluating training-data liability. Class-action lawsuits from creators are multiplying. Whether or not any single regulation applies to your pipeline today, the direction is clear: AI companies will need to demonstrate that they made reasonable efforts to respect creator consent.

The problem is structural. There is no standardised way for creators to declare whether their work is available for AI training. robots.txt was never designed for this. EXIF metadata gets stripped. Opt-out forms are scattered across hundreds of company-specific portals.

Sourcemark provides a single, machine-readable registry of timestamped consent declarations. Instead of reverse-engineering creator intent from crawled web pages, your pipeline can query a structured API and get a definitive answer.

What the registry provides

Every sourcemarked file has a record in the registry containing:

  • Consent status— either “Not available” (do not use for training) or “Available by licence” (contact the declarer for terms)
  • Timestamp — when the declaration was made, providing a clear temporal record for compliance audits
  • Declarer identity — who made the declaration, if they have chosen to make their profile public
  • Licensing contact pathway — for files marked as available, a route to reach the declarer and negotiate terms
  • Version history — the full chain of declaration changes over time, so you can see what was declared at any point

If a query returns no match, that means no consent declaration has been registered for that file. Absence of a record is not consent. Your compliance process should treat unmatched files as having no structured signal either way.

Important

When multiple declarations exist for the same file, Sourcemark applies most-restrictive aggregation: if any declarer says “not available,” the aggregated status is not available. Licensing availability is combined with OR logic — if any declarer offers a licence, that pathway is surfaced.

How to integrate

Sourcemark is designed for pipeline integration. The typical pattern is a consent check gate in your data ingestion workflow:

  1. Fingerprint the asset. Generate a SHA-256 hash for exact matching. For images, also generate a perceptual hash to catch crops, resizes, and recompressed variants.
  2. Query the registry. Submit the fingerprint to the Sourcemark API. The response tells you whether a consent declaration exists and what it says.
  3. Route based on status. Not-available files get excluded. Licensed files get routed to your licensing workflow. No-match files proceed through your existing process — but you now have a documented record that you checked.

This check adds milliseconds to your pipeline and gives you an auditable log of consent due diligence for every asset you process.

API overview

The Sourcemark API supports two query modes:

Single file check— submit one fingerprint and receive the consent status, timestamp, declarer information, and licensing pathway. Suitable for real-time checks during interactive workflows or low-volume ingestion.

Batch queries— submit fingerprints in bulk for dataset-scale auditing. If you are evaluating a training corpus of millions of files, batch mode lets you run consent checks across the entire dataset without per-request overhead.

Both modes accept two fingerprint formats:

  • SHA-256 for exact byte-level matching. If the file has not been modified at all, this will find the record.
  • Perceptual hash for visual similarity matching. This catches files that have been cropped, resized, recompressed, or lightly edited. The public similarity endpoint is POST /api/public/similarity/phash.

All responses are machine-readable JSON. The API is rate-limited on the public tier. Paid API access removes rate limits and enables batch queries.

Consent interpretation guidelines

The registry provides structured records, not legal advice. Here is how to interpret what you get back:

  • Not available — the declarer has explicitly stated this work should not be used for AI training. Respect this declaration.
  • Available by licence — the declarer is open to AI training use under negotiated terms. Use the licensing contact pathway to reach them before including the asset.
  • No match — no declaration exists in the registry for this fingerprint. This is not consent. It means the creator has not yet registered a signal through Sourcemark. Apply your own risk framework.
  • Multiple declarations— aggregated using most-restrictive logic. A single “not available” declaration from any rights holder overrides all “available” declarations for the same asset.

Note

Sourcemark does not guarantee legal compliance. It provides the consent infrastructure that makes compliance tractable. The registry gives your legal and compliance teams structured, timestamped evidence that you checked before you trained.

Getting started

  1. Explore the public endpoint. The public similarity API at POST /api/public/similarity/phash is unauthenticated and rate-limited. Use it to evaluate the registry against a sample of your dataset.
  2. Contact us for API access. Paid tiers unlock batch queries, higher rate limits, and SHA-256 exact matching. Reach out through sourcemark.ai to discuss your pipeline requirements.
  3. Integrate into your ingestion workflow. Add the consent check as a gate in your data pipeline. Log every query and response for your compliance records.
  4. Build reporting.Use query results to generate consent coverage reports for your training datasets — what percentage has been checked, what percentage has explicit declarations, and what the breakdown of statuses looks like.

The registry is live. Creators are sourcemarking their work today. The sooner you integrate, the stronger your due diligence position becomes.


Have questions? Check the FAQ or learn what Sourcemark is.