Internet Archive CLI Skill

This skill enables interaction with the Internet Archive (archive.org) using the ia command-line tool from the internetarchive Python package.

Items

An item is the fundamental unit on archive.org - a logical grouping of related files sharing common metadata. An item can be a book, a song, an album, a dataset, a movie, an image or set of images, etc. Each item has a unique identifier across the entire archive.

Every item contains:

Original uploaded files
Derivative files (automatically generated by archive.org)
<identifier>_meta.xml - item-level metadata
<identifier>_files.xml - file-level metadata

Items must belong to a collection.

Item Limits

Constraint	Recommended	Hard Limit
Item total size	Under 100GB	~1TB
Files per item	Under 10,000	250,000 (performance degrades >10,000)
Single file size	Under 50GB	500-700GB
Daily upload	Under 1,000 files	5,000 files (zips count as 1)

Permanent URL patterns:

Details page: https://archive.org/details/<identifier>
Download directory: https://archive.org/download/<identifier>
Specific file: https://archive.org/download/<identifier>/<filename>
Item history: https://archive.org/history/<identifier>

Warning: Never link to server-specific URLs like ia802304.us.archive.org - these break when items migrate between servers. Always use the canonical archive.org URLs above.

For more details, see: https://archive.org/developers/items.html

Derivatives

When you upload files to the Internet Archive, the system automatically generates derivative files - converted versions in different formats and resolutions. For example:

Video: Transcoded to h.264, Ogg, and various bitrates
Audio: Converted to MP3 (multiple bitrates), Ogg Vorbis, FLAC
Text/Books: OCR processing, searchable PDFs, EPUB, DjVu
Images: Thumbnails, JPEG 2000, different resolutions

Derivatives make content accessible across different devices and bandwidths. You can identify derivatives in ia list output - they have an original field pointing to their source file.

To skip derivative generation during upload, use --no-derive:

ia upload my-item file.mp4 --metadata="mediatype:movies" --no-derive

For the complete list of source formats and their generated derivatives, see: https://archive.org/help/derivatives.php

Metadata Schema

Internet Archive items use XML-based metadata. Key points:

Required fields: identifier, mediatype
Recommended fields: title, description, creator, date, subject, collection, language
Repeatable fields: collections, creators, subjects, languages support multiple values
Custom fields: You can define unlimited custom metadata fields (must follow XML naming rules)

Identifier requirements:

ASCII alphanumeric, underscores, dashes, or periods only
Must begin with alphanumeric character
5-100 characters (5-80 recommended)
Unique and unchangeable once set

For the complete metadata schema reference, see: https://archive.org/developers/metadata-schema

Collections

Collections group related items together. Key points:

Only IA staff can create collections - users must request creation
Minimum 50 items required for a new collection
Items must be related and typically same media type
Collection creation takes up to two weeks after request

To request a collection, contact Internet Archive with:

List of item identifiers or search query identifying items
Desired collection identifier (5-80 chars, alphanumeric only)
Collection title and description
At least one subject tag

Public upload collections (anyone can upload to):

opensource_movies, opensource_audio, opensource_media - general media
community_texts, community_video, community_audio - community contributions

Other collections restrict uploads to designated uploaders only.

Tool Detection and Installation

Before using any ia commands, check if the tool is installed:

ia --version

If the ia command is not found, install it using uv:

uv tool install internetarchive

Alternative installation methods:

pipx install internetarchive
pip install internetarchive

After installation, verify it works with ia --version.

Global Options

These options work with all ia commands:

Option	Description
`-h, --help`	Show help message
`-v, --version`	Display version
`-c FILE, --config-file`	Path to config file
`-l, --log`	Enable logging
`-d, --debug`	Enable debug output

Configuration and Authentication

Check if ia is configured:

ia configure --whoami

If not configured (shows error or empty), the user needs to set up credentials:

Interactive setup: Run ia configure and follow prompts
Get credentials: IA-S3 keys from https://archive.org/account/s3.php
Config location: Saves to ~/.config/ia.ini

Configure Options

Option	Description
`--whoami`	Print current authenticated user
`--show`	Print current config as JSON
`--check`	Validate IA-S3 keys (exit 0 if valid, 1 otherwise)

# Show current config
ia configure --show

# Validate keys (useful in scripts)
ia configure --check && echo "Keys valid"

Environment Variables

Alternative to config file:

export IA_ACCESS_KEY_ID="your-access-key"
export IA_SECRET_ACCESS_KEY="your-secret-key"

Note: Configuration is required for uploads and metadata modifications. Searching and downloading public items works without authentication.

User-Agent Identification (Required)

All requests to the Internet Archive must include a proper User-Agent string that clearly identifies the source of the request. This applies to every request made via any tool - the ia CLI, Python library, direct API calls, curl, or any other HTTP client. This is critical for AI agents, bots, and automated tools.

The ia CLI automatically includes a default User-Agent with your access key:

internetarchive/5.7.2 (Linux x86_64; N; en; ACCESS_KEY) Python/3.11.0

When using Claude Code or other AI/LLM agents, you must append a custom suffix that includes:

The tool/agent name and version (e.g., "Claude Code/1.0.0")
The model being used if applicable (e.g., "claude-sonnet-4-20250514")
Any relevant context about the automation

The --user-agent-suffix CLI option and user_agent_suffix config setting require internetarchive version 5.7.2 or newer. The default User-Agent (including access key) is always sent - your suffix is appended to it.

CLI:

ia --user-agent-suffix "Claude Code/1.0.0 (claude-sonnet-4-20250514)" download my-item

INI file (~/.config/internetarchive/ia.ini):

[general]
user_agent_suffix = Claude Code/1.0.0 (claude-sonnet-4-20250514)

Python API:

from internetarchive import get_session

session = get_session(config={
    'general': {'user_agent_suffix': 'Claude Code/1.0.0 (claude-sonnet-4-20250514)'}
})

The resulting User-Agent will look like:

internetarchive/5.7.2 (Linux x86_64; N; en; ACCESS_KEY) Python/3.11.0 Claude Code/1.0.0 (claude-sonnet-4-20250514)

This helps the Internet Archive track usage patterns, troubleshoot issues, and maintain service quality. Always be specific - include version numbers, model identifiers, and enough detail to distinguish your tool from others.

Search Operations

Search the Internet Archive catalog:

ia search '<query>'

Search Parameters

Parameter	Description
`--itemlist`	Output identifiers only, one per line
`-n, --num-found`	Print only the count of results
`-s, --sort`	Sort results: `--sort='field desc'` or `--sort='field asc'`
`-f, --field`	Return specific metadata fields (repeatable)
`-F, --fts`	Full-text search (search within text content, not just metadata)
`--parameters`	Raw query parameters: `--parameters="page=N&rows=N"`

# Get result count only
ia search 'collection:nasa' -n

# Sort by date descending
ia search 'mediatype:texts' --sort='date desc'

# Return specific fields
ia search 'collection:nasa' --field=identifier --field=title

Sort Fields

Common sort fields for use with --sort:

Field	Description
`date`	Content date
`publicdate`	When item was published to archive.org
`addeddate`	When added to archive
`updatedate`	Last updated
`title` / `titleSorter`	Alphabetical by title
`creator` / `creatorSorter`	Alphabetical by creator
`downloads`	Total downloads
`week`	Downloads this week
`month`	Downloads this month
`num_reviews`	Number of reviews
`num_favorites`	Number of favorites
`item_size`	Total item size
`files_count`	Number of files

Use asc or desc suffix:

ia search 'mediatype:audio' --sort='downloads desc'
ia search 'collection:books' --sort='publicdate asc'
ia search 'creator:NASA' --sort='title asc'

Search Query Syntax

The Internet Archive uses Apache Lucene query syntax. By default, the operator is AND (all terms must be present).

Query Operators

Operator	Description
`AND`	All terms must be present (default)
`OR`	Any of the terms can be present
`NOT`	Exclude documents with term (requires at least one positive term)
`( )`	Group clauses to form subqueries

Field-Specific Searches

Use field:value syntax to search specific metadata fields:

Query	Description
`'title:"search text"'`	By title
`'creator:"Author Name"'`	By creator/author
`'subject:"topic"'`	Search by subject
`'description:"text"'`	By description
`'collection:name'`	Items in a collection
`'mediatype:texts'`	By media type (texts, movies, audio, software, image, data)
`'contributor:smithsonian'`	By contributor
`'language:eng'`	By language code
`'format:pdf'`	Items containing specific file format
`'isbn:9780123456789'`	By ISBN
`'licenseurl:httpby-nc'`	By Creative Commons license

Range Queries

Search values between bounds using brackets or parentheses:

Syntax	Description
`[1000 TO 2000]`	Inclusive range (includes bounds)
`{1000 TO 2000}`	Exclusive range (excludes bounds)
`[1000 TO null]`	Open-ended range (1000 or greater)
`[null TO 2000]`	Open-ended range (2000 or less)

Date Fields

Searchable date fields: addeddate, createdate, date, indexdate, publicdate, reviewdate, updatedate, oai_updatedate

Query	Description
`'date:[2020-01-01 TO 2024-12-31]'`	Date range
`'publicdate:[2024-01-01 TO 2024-06-30]'`	By publication date
`'indexdate:[2024-01-01T00:00:00Z TO 2024-12-31T23:59:59Z]'`	With timestamp
`'date:2024*'`	Wildcard for year (non-range)

Fuzzy Queries

Append ~ for approximate spelling matches:

ia search 'title:buttonwood~'

# Boost fuzzy matches with weights
ia search '(title:buttonwood~)^150 OR (subject:buttonwood~)^100'

Searching for Missing Fields

Find items where a field doesn't exist:

ia search 'collection:microfiche AND NOT _exists_:creator'

Searching by Uploader

Search by uploader's user item, screen name, or email:

ia search '_uploader_useritem:@username'
ia search '_uploader_screenname:"Display Name"'
ia search 'uploader:your@email.com'

Additional Searchable Fields

Beyond standard metadata, you can search by:

downloads - download count
item_size - total item size in bytes
files_count - number of files
collection_size - size of collection
item_count - items in collection

ia search 'collection:opensource AND downloads:[1000 TO null]'
ia search 'mediatype:movies AND item_size:[1000000000 TO null]'

Combined Queries

# AND is implicit between terms
ia search 'collection:nasa mediatype:image'

# Explicit operators
ia search 'collection:nasa AND mediatype:image'
ia search 'mediatype:texts OR mediatype:audio'
ia search 'collection:opensource NOT mediatype:software'

# Grouped subqueries
ia search '(mediatype:texts OR mediatype:audio) AND creator:"Mark Twain"'

Full-Text Search

Use the -F (or --fts) flag to search within the actual text content of items rather than just metadata. This is particularly powerful for searching text collections like books, documents, and OCR'd materials.

Basic full-text search:

ia search -F 'collection:collection_name "search phrase"'

How it works:

Searches inside the full text of documents (OCR'd PDFs, text files, etc.)
More powerful than metadata-only search for finding specific quotes or passages
Requires items to have searchable text (OCR or text files)
Can be combined with collection and metadata filters

Full-text search syntax:

Use quotes for exact phrases: "complete phrase"
Combine with metadata filters: collection:name AND "text to find"
Works best with text collections that have been OCR'd

Examples

# Search NASA images
ia search 'collection:nasa mediatype:image' --parameters="rows=10"

# Search public domain books
ia search 'subject:"public domain" mediatype:texts'

# Get just identifiers
ia search 'creator:"Mark Twain"' --itemlist

# Full-text search within a text collection
ia search -F 'collection:books "climate change"'

# Full-text search for a specific quote in public domain texts
ia search -F '"to be or not to be" mediatype:texts'

# Full-text search with collection filter and pagination
ia search -F 'collection:usgovernmentdocuments "artificial intelligence"' --parameters="rows=20"

Download Operations

Download files from an Internet Archive item:

ia download <identifier>

Download Parameters

Parameter	Description
`--glob="*.ext"`	Download only matching files (use `\|` for multiple: `'.mp4\|.webm'`)
`--exclude="pattern"`	Exclude files matching pattern
`--format="FORMAT"`	Download specific derivative format
`--source=SOURCE`	Filter by source: `original`, `derivative`, `metadata`
`--exclude-source=SOURCE`	Exclude by source type
`--destdir=path`	Download to specific directory
`--no-directories`	Flatten directory structure
`-s, --stdout`	Write file to stdout (for piping)
`--dry-run`	Show what would be downloaded
`--checksum`	Skip files that already exist with correct checksum
`--on-the-fly`	Download on-the-fly files (generated derivatives)
`--search="QUERY"`	Download from search results
`--itemlist=FILE`	Download items listed in file

Filtering by Source Type

Use --source and --exclude-source to filter by file origin:

# Download only original files (skip all derivatives)
ia download my-item --source=original

# Download originals and metadata, skip derivatives
ia download my-item --exclude-source=derivative

# Download only metadata files
ia download my-item --source=metadata

Examples

# Download all files from an item
ia download TripDown1905

# Download specific files by name
ia download TripDown1905 file1.mp4 file2.ogv

# Download only MP4 files
ia download TripDown1905 --glob="*.mp4"

# Download MP4s but exclude low-quality versions
ia download TripDown1905 --glob="*.mp4" --exclude="*512kb*"

# Download specific format
ia download TripDown1905 --format='512Kb MPEG4'

# Download to specific directory
ia download TripDown1905 --destdir=./downloads

# Download from search results
ia download --search 'collection:opensource_movies' --glob="*.mp4"

# Download items from a list file
ia search 'collection:glasgowschoolofart' --itemlist > itemlist.txt
ia download --itemlist itemlist.txt

# Preview what will be downloaded
ia download my_item --dry-run

Upload Operations

Upload files to the Internet Archive (requires authentication):

ia upload <identifier> file1 file2 --metadata="mediatype:value"

Required Metadata

The mediatype field is required. Common values:

texts - Books, documents, PDFs
movies - Video files
audio - Music, podcasts, sound
software - Programs, games
image - Photos, graphics
data - Datasets, archives

Upload Parameters

Parameter	Description
`--metadata="key:value"`	Set metadata (repeatable)
`--header="key:value"`	Set HTTP header
`--checksum`	Skip files already uploaded
`-v, --verify`	Verify data wasn't corrupted after upload
`--no-derive`	Skip derivative processing
`--retries=N`	Number of retry attempts
`--remote-name=NAME`	Set remote filename (for stdin uploads)
`--keep-directories`	Preserve directory structure in remote filename
`-o, --open-after-upload`	Open item in browser after upload
`--file-metadata=FILE`	File-level metadata from JSONL file
`--spreadsheet=FILE`	Bulk upload from CSV spreadsheet

Common Metadata Fields

--metadata="title:My Document Title"
--metadata="creator:Author Name"
--metadata="description:A description of the content"
--metadata="subject:topic1;topic2"
--metadata="collection:community_texts"
--metadata="date:2024-01-15"
--metadata="language:eng"

Examples

# Upload a PDF document
ia upload my-document-2024 document.pdf \
  --metadata="mediatype:texts" \
  --metadata="title:My Document" \
  --metadata="creator:John Doe"

# Upload multiple files
ia upload my-archive file1.pdf file2.pdf file3.pdf \
  --metadata="mediatype:texts" \
  --metadata="title:Document Collection"

# Upload with checksum verification and retries
ia upload my-item large-file.zip \
  --metadata="mediatype:data" \
  --checksum \
  --retries=10

# Upload from stdin
cat data.gz | ia upload my-item - \
  --remote-name=data.gz \
  --metadata="mediatype:data"

# Bulk upload using spreadsheet
ia upload --spreadsheet=metadata.csv

Notes:

Items receive data mediatype by default if not specified
Mediatype can only be changed after upload with admin support
Derivative generation takes seconds to days depending on file type and system load
Items typically appear in search within minutes, but can take up to 24 hours

Test Collection

Upload to test_collection for validation - items are automatically removed after ~30 days:

ia upload my-test-item file.pdf \
  --metadata="mediatype:texts" \
  --metadata="collection:test_collection"

Identifier Guidelines

Use lowercase letters, numbers, and hyphens
No spaces or special characters
Keep it descriptive but concise
Check if identifier exists: ia metadata <identifier> --exists

Item Thumbnail Image

To set a custom thumbnail for an item, upload an image named <identifier>_itemimage.jpg:

ia upload my-item my-item_itemimage.jpg

Restricting Downloads

To make files streamable but not downloadable, add the item to the stream_only collection:

ia metadata <identifier> --append-list="collection:stream_only"

Metadata Operations

View and modify item metadata:

# View metadata (JSON output)
ia metadata <identifier>

# Extract specific field with jq
ia metadata <identifier> | jq '.metadata.date'

# List file formats contained in an item
ia metadata <identifier> --formats

# Modify metadata (set or replace)
ia metadata <identifier> --modify="title:New Title"
ia metadata <identifier> --modify="foo:bar" --modify="baz:value"

# Remove a metadata field
ia metadata <identifier> --modify="fieldname:REMOVE_TAG"

# Append value to existing field
ia metadata <identifier> --append="title:Subtitle Here"

# Append to list field (e.g., subjects)
ia metadata <identifier> --append-list="subject:new topic"

# Remove specific value from list field
ia metadata <identifier> --remove="subject:old topic"

# Modify file-level metadata
ia metadata <identifier> --target="files/foo.txt" --modify="title:My File"

# Bulk updates from spreadsheet
ia metadata --spreadsheet=metadata.csv

List Operations

List files in an Internet Archive item:

ia list <identifier>

Shows all files with details (name, size, format).

List Parameters

Parameter	Description
`--columns=name,size`	Specify columns to show
`--glob="*.pdf"`	Filter by pattern
`-l, --location`	Print full URLs for each file
`-a, --all`	List all available file information
`-v, --verbose`	Print column headers

# List with full URLs
ia list my-item --location

# List all file info with headers
ia list my-item --all --verbose

# List specific columns
ia list my-item --columns=name,size,format

Tasks and Jobs

Check status of catalog tasks (uploads, derives, etc.):

# Check tasks for a specific item
ia tasks <identifier>

# Check all your tasks
ia tasks

Darking and Undarking Items

To make an item dark (hidden from public access) or undark it:

# Dark an item (requires comment)
ia tasks <identifier> --cmd=make_dark.php --comment="Reason for darking"

# Undark an item
ia tasks <identifier> --cmd=make_undark.php --comment="Reason for undarking"

Bulk Operations with GNU Parallel

For batch processing many items, use GNU Parallel to run ia commands concurrently.

Installation

# macOS
brew install parallel

# Debian/Ubuntu
apt install parallel

Basic Usage

Pipe item identifiers to parallel, using {} as placeholder:

# Fetch metadata for many items
cat itemlist.txt | parallel 'ia metadata {}'

# Download multiple items
cat itemlist.txt | parallel 'ia download {}'

Careful Batch Processing

For reliable bulk operations, use job logging to track progress and handle failures:

# Step 1: Create item list
ia search 'collection:myproject' --itemlist > itemlist.txt

# Step 2: Run with job logging
cat itemlist.txt | parallel --joblog job.log 'ia download {}'

# Step 3: Check for failures
echo $?  # 0 = all succeeded

# Step 4: Retry only failed jobs
parallel --retry-failed --joblog job.log

Job Log Benefits

The --joblog file tracks each command's exit status, allowing you to:

Resume interrupted batch jobs
Retry only failed items without re-processing successes
Audit what succeeded and failed

Dry Run First

Always preview before bulk execution:

cat itemlist.txt | parallel --dry-run 'ia download {}'

Rate Limiting

Control concurrency to avoid overwhelming the server:

# Limit to 4 concurrent jobs
cat itemlist.txt | parallel -j4 'ia download {}'

# Add delay between jobs
cat itemlist.txt | parallel --delay 1 'ia download {}'

See: https://archive.org/developers/internetarchive/parallel.html

Best Practices

Always configure before uploading - Run ia configure first
Use meaningful identifiers - Descriptive, lowercase, hyphenated
Include proper metadata - At minimum: mediatype, title, creator
Check before uploading - Verify identifier doesn't exist: ia metadata <id>
Use checksums - Add --checksum for large uploads to enable resume
Respect rate limits - Don't spam requests; add delays for bulk operations
Test with dry-run - Use --dry-run to preview operations
Use test_collection first - Validate uploads before committing to permanent collections
Zip large file sets - Bundle many small files into archives before uploading
Specify language - Set language metadata for proper OCR processing on texts

Error Handling

Error	Solution
"not configured"	Run `ia configure` or set environment variables
"identifier exists"	Choose a different identifier
"permission denied"	Check credentials at https://archive.org/account/s3.php
"network error"	Retry the operation; check internet connection
"item not found"	Verify the identifier spelling
"429 Too Many Requests"	Rate limited; wait and retry with `Retry-After` header value
Item not appearing in search	Usually appears within minutes; check `ia tasks <identifier>` for pending jobs
Derive task failed	Check filename characters, file format, language metadata

Quick Reference

# Search
ia search 'query'
ia search 'query' --itemlist

# Download
ia download <identifier>
ia download <identifier> --glob="*.pdf"

# Upload (requires auth)
ia upload <identifier> files --metadata="mediatype:texts"

# Metadata
ia metadata <identifier>
ia metadata <identifier> --modify="title:New Title"

# List files
ia list <identifier>

# Tasks
ia tasks <identifier>

# Config
ia configure
ia configure --whoami

# Install
uv tool install internetarchive

API Reference

For programmatic access beyond the CLI, see the full developer documentation: https://archive.org/developers

Core APIs

API	Description
Items	Understanding item structure and access
Metadata Schema	Complete metadata field reference
Metadata Read	Retrieve item metadata via API
Metadata Write	Modify item metadata via API
IAS3	S3-compatible API for uploads
Tasks	Task queue management

Additional APIs

API	Description
Changes	Track item modifications across the archive
Views	Access viewing and download statistics
Reviews	Manage item reviews
Simple Lists	Create item relationships and lists
OCR Service	Text recognition service
PDF Service	PDF generation and processing

Python Library

For Python integration: internetarchive library

TypeScript Library (Third-Party)

A community-maintained TypeScript port is available: internetarchive-ts (docs)

Note: This is a work in progress and not officially maintained by the Internet Archive.

iaSafety 92Repository ShareFavorite skill

Package Files

Internet Archive CLI Skill

Items

Item Limits

Derivatives

Metadata Schema

Collections

Tool Detection and Installation

Global Options

Configuration and Authentication

Configure Options

Environment Variables

User-Agent Identification (Required)

Search Operations

Search Parameters

Sort Fields

Search Query Syntax

Query Operators

Field-Specific Searches

Range Queries

Date Fields

Fuzzy Queries

Searching for Missing Fields

Searching by Uploader

Additional Searchable Fields

Combined Queries

Full-Text Search

Examples

Download Operations

Download Parameters

Filtering by Source Type

Examples

Upload Operations

Required Metadata

Upload Parameters

Common Metadata Fields

Examples

Test Collection

Identifier Guidelines

Item Thumbnail Image

Restricting Downloads

Metadata Operations

List Operations

List Parameters

Tasks and Jobs

Darking and Undarking Items

Bulk Operations with GNU Parallel

Installation

Basic Usage

Careful Batch Processing

Job Log Benefits

Dry Run First

Rate Limiting

Best Practices

Error Handling

Quick Reference

API Reference

Core APIs

Additional APIs

Python Library

TypeScript Library (Third-Party)

Install

AI Quality Score

Metadata

Tags

iaSafety 92Repository