Internet Archive CLI Skill
This skill enables interaction with the Internet Archive (archive.org) using the ia command-line tool from the internetarchive Python package.
Items
An item is the fundamental unit on archive.org - a logical grouping of related files sharing common metadata. An item can be a book, a song, an album, a dataset, a movie, an image or set of images, etc. Each item has a unique identifier across the entire archive.
Every item contains:
- Original uploaded files
- Derivative files (automatically generated by archive.org)
<identifier>_meta.xml- item-level metadata<identifier>_files.xml- file-level metadata
Items must belong to a collection.
Item Limits
| Constraint | Recommended | Hard Limit |
|---|---|---|
| Item total size | Under 100GB | ~1TB |
| Files per item | Under 10,000 | 250,000 (performance degrades >10,000) |
| Single file size | Under 50GB | 500-700GB |
| Daily upload | Under 1,000 files | 5,000 files (zips count as 1) |
Permanent URL patterns:
- Details page:
https://archive.org/details/<identifier> - Download directory:
https://archive.org/download/<identifier> - Specific file:
https://archive.org/download/<identifier>/<filename> - Item history:
https://archive.org/history/<identifier>
Warning: Never link to server-specific URLs like ia802304.us.archive.org - these break when items migrate between servers. Always use the canonical archive.org URLs above.
For more details, see: https://archive.org/developers/items.html
Derivatives
When you upload files to the Internet Archive, the system automatically generates derivative files - converted versions in different formats and resolutions. For example:
- Video: Transcoded to h.264, Ogg, and various bitrates
- Audio: Converted to MP3 (multiple bitrates), Ogg Vorbis, FLAC
- Text/Books: OCR processing, searchable PDFs, EPUB, DjVu
- Images: Thumbnails, JPEG 2000, different resolutions
Derivatives make content accessible across different devices and bandwidths. You can identify derivatives in ia list output - they have an original field pointing to their source file.
To skip derivative generation during upload, use --no-derive:
ia upload my-item file.mp4 --metadata="mediatype:movies" --no-derive
For the complete list of source formats and their generated derivatives, see: https://archive.org/help/derivatives.php
Metadata Schema
Internet Archive items use XML-based metadata. Key points:
- Required fields:
identifier,mediatype - Recommended fields:
title,description,creator,date,subject,collection,language - Repeatable fields: collections, creators, subjects, languages support multiple values
- Custom fields: You can define unlimited custom metadata fields (must follow XML naming rules)
Identifier requirements:
- ASCII alphanumeric, underscores, dashes, or periods only
- Must begin with alphanumeric character
- 5-100 characters (5-80 recommended)
- Unique and unchangeable once set
For the complete metadata schema reference, see: https://archive.org/developers/metadata-schema
Collections
Collections group related items together. Key points:
- Only IA staff can create collections - users must request creation
- Minimum 50 items required for a new collection
- Items must be related and typically same media type
- Collection creation takes up to two weeks after request
To request a collection, contact Internet Archive with:
- List of item identifiers or search query identifying items
- Desired collection identifier (5-80 chars, alphanumeric only)
- Collection title and description
- At least one subject tag
Public upload collections (anyone can upload to):
opensource_movies,opensource_audio,opensource_media- general mediacommunity_texts,community_video,community_audio- community contributions
Other collections restrict uploads to designated uploaders only.
Tool Detection and Installation
Before using any ia commands, check if the tool is installed:
ia --version
If the ia command is not found, install it using uv:
uv tool install internetarchive
Alternative installation methods:
pipx install internetarchivepip install internetarchive
After installation, verify it works with ia --version.
Global Options
These options work with all ia commands:
| Option | Description |
|---|---|
-h, --help | Show help message |
-v, --version | Display version |
-c FILE, --config-file | Path to config file |
-l, --log | Enable logging |
-d, --debug | Enable debug output |
Configuration and Authentication
Check if ia is configured:
ia configure --whoami
If not configured (shows error or empty), the user needs to set up credentials:
- Interactive setup: Run
ia configureand follow prompts - Get credentials: IA-S3 keys from https://archive.org/account/s3.php
- Config location: Saves to
~/.config/ia.ini
Configure Options
| Option | Description |
|---|---|
--whoami | Print current authenticated user |
--show | Print current config as JSON |
--check | Validate IA-S3 keys (exit 0 if valid, 1 otherwise) |
# Show current config
ia configure --show
# Validate keys (useful in scripts)
ia configure --check && echo "Keys valid"
Environment Variables
Alternative to config file:
export IA_ACCESS_KEY_ID="your-access-key"
export IA_SECRET_ACCESS_KEY="your-secret-key"
Note: Configuration is required for uploads and metadata modifications. Searching and downloading public items works without authentication.
User-Agent Identification (Required)
All requests to the Internet Archive must include a proper User-Agent string that clearly identifies the source of the request. This applies to every request made via any tool - the ia CLI, Python library, direct API calls, curl, or any other HTTP client. This is critical for AI agents, bots, and automated tools.
The ia CLI automatically includes a default User-Agent with your access key:
internetarchive/5.7.2 (Linux x86_64; N; en; ACCESS_KEY) Python/3.11.0
When using Claude Code or other AI/LLM agents, you must append a custom suffix that includes:
- The tool/agent name and version (e.g., "Claude Code/1.0.0")
- The model being used if applicable (e.g., "claude-sonnet-4-20250514")
- Any relevant context about the automation
The --user-agent-suffix CLI option and user_agent_suffix config setting require internetarchive version 5.7.2 or newer. The default User-Agent (including access key) is always sent - your suffix is appended to it.
CLI:
ia --user-agent-suffix "Claude Code/1.0.0 (claude-sonnet-4-20250514)" download my-item
INI file (~/.config/internetarchive/ia.ini):
[general]
user_agent_suffix = Claude Code/1.0.0 (claude-sonnet-4-20250514)
Python API:
from internetarchive import get_session
session = get_session(config={
'general': {'user_agent_suffix': 'Claude Code/1.0.0 (claude-sonnet-4-20250514)'}
})
The resulting User-Agent will look like:
internetarchive/5.7.2 (Linux x86_64; N; en; ACCESS_KEY) Python/3.11.0 Claude Code/1.0.0 (claude-sonnet-4-20250514)
This helps the Internet Archive track usage patterns, troubleshoot issues, and maintain service quality. Always be specific - include version numbers, model identifiers, and enough detail to distinguish your tool from others.
Search Operations
Search the Internet Archive catalog:
ia search '<query>'
Search Parameters
| Parameter | Description |
|---|---|
--itemlist | Output identifiers only, one per line |
-n, --num-found | Print only the count of results |
-s, --sort | Sort results: --sort='field desc' or --sort='field asc' |
-f, --field | Return specific metadata fields (repeatable) |
-F, --fts | Full-text search (search within text content, not just metadata) |
--parameters | Raw query parameters: --parameters="page=N&rows=N" |
# Get result count only
ia search 'collection:nasa' -n
# Sort by date descending
ia search 'mediatype:texts' --sort='date desc'
# Return specific fields
ia search 'collection:nasa' --field=identifier --field=title
Sort Fields
Common sort fields for use with --sort:
| Field | Description |
|---|---|
date | Content date |
publicdate | When item was published to archive.org |
addeddate | When added to archive |
updatedate | Last updated |
title / titleSorter | Alphabetical by title |
creator / creatorSorter | Alphabetical by creator |
downloads | Total downloads |
week | Downloads this week |
month | Downloads this month |
num_reviews | Number of reviews |
num_favorites | Number of favorites |
item_size | Total item size |
files_count | Number of files |
Use asc or desc suffix:
ia search 'mediatype:audio' --sort='downloads desc'
ia search 'collection:books' --sort='publicdate asc'
ia search 'creator:NASA' --sort='title asc'
Search Query Syntax
The Internet Archive uses Apache Lucene query syntax. By default, the operator is AND (all terms must be present).
Query Operators
| Operator | Description |
|---|---|
AND | All terms must be present (default) |
OR | Any of the terms can be present |
NOT | Exclude documents with term (requires at least one positive term) |
( ) | Group clauses to form subqueries |
Field-Specific Searches
Use field:value syntax to search specific metadata fields:
| Query | Description |
|---|---|
'title:"search text"' | By title |
'creator:"Author Name"' | By creator/author |
'subject:"topic"' | Search by subject |
'description:"text"' | By description |
'collection:name' | Items in a collection |
'mediatype:texts' | By media type (texts, movies, audio, software, image, data) |
'contributor:smithsonian' | By contributor |
'language:eng' | By language code |
'format:pdf' | Items containing specific file format |
'isbn:9780123456789' | By ISBN |
'licenseurl:http*by-nc*' | By Creative Commons license |
Range Queries
Search values between bounds using brackets or parentheses:
| Syntax | Description |
|---|---|
[1000 TO 2000] | Inclusive range (includes bounds) |
{1000 TO 2000} | Exclusive range (excludes bounds) |
[1000 TO null] | Open-ended range (1000 or greater) |
[null TO 2000] | Open-ended range (2000 or less) |
Date Fields
Searchable date fields: addeddate, createdate, date, indexdate, publicdate, reviewdate, updatedate, oai_updatedate
| Query | Description |
|---|---|
'date:[2020-01-01 TO 2024-12-31]' | Date range |
'publicdate:[2024-01-01 TO 2024-06-30]' | By publication date |
'indexdate:[2024-01-01T00:00:00Z TO 2024-12-31T23:59:59Z]' | With timestamp |
'date:2024*' | Wildcard for year (non-range) |
Fuzzy Queries
Append ~ for approximate spelling matches:
ia search 'title:buttonwood~'
# Boost fuzzy matches with weights
ia search '(title:buttonwood~)^150 OR (subject:buttonwood~)^100'
Searching for Missing Fields
Find items where a field doesn't exist:
ia search 'collection:microfiche AND NOT _exists_:creator'
Searching by Uploader
Search by uploader's user item, screen name, or email:
ia search '_uploader_useritem:@username'
ia search '_uploader_screenname:"Display Name"'
ia search 'uploader:your@email.com'
Additional Searchable Fields
Beyond standard metadata, you can search by:
downloads- download countitem_size- total item size in bytesfiles_count- number of filescollection_size- size of collectionitem_count- items in collection
ia search 'collection:opensource AND downloads:[1000 TO null]'
ia search 'mediatype:movies AND item_size:[1000000000 TO null]'
Combined Queries
# AND is implicit between terms
ia search 'collection:nasa mediatype:image'
# Explicit operators
ia search 'collection:nasa AND mediatype:image'
ia search 'mediatype:texts OR mediatype:audio'
ia search 'collection:opensource NOT mediatype:software'
# Grouped subqueries
ia search '(mediatype:texts OR mediatype:audio) AND creator:"Mark Twain"'
Full-Text Search
Use the -F (or --fts) flag to search within the actual text content of items rather than just metadata. This is particularly powerful for searching text collections like books, documents, and OCR'd materials.
Basic full-text search:
ia search -F 'collection:collection_name "search phrase"'
How it works:
- Searches inside the full text of documents (OCR'd PDFs, text files, etc.)
- More powerful than metadata-only search for finding specific quotes or passages
- Requires items to have searchable text (OCR or text files)
- Can be combined with collection and metadata filters
Full-text search syntax:
- Use quotes for exact phrases:
"complete phrase" - Combine with metadata filters:
collection:name AND "text to find" - Works best with text collections that have been OCR'd
Examples
# Search NASA images
ia search 'collection:nasa mediatype:image' --parameters="rows=10"
# Search public domain books
ia search 'subject:"public domain" mediatype:texts'
# Get just identifiers
ia search 'creator:"Mark Twain"' --itemlist
# Full-text search within a text collection
ia search -F 'collection:books "climate change"'
# Full-text search for a specific quote in public domain texts
ia search -F '"to be or not to be" mediatype:texts'
# Full-text search with collection filter and pagination
ia search -F 'collection:usgovernmentdocuments "artificial intelligence"' --parameters="rows=20"
Download Operations
Download files from an Internet Archive item:
ia download <identifier>
Download Parameters
| Parameter | Description |
|---|---|
--glob="*.ext" | Download only matching files (use | for multiple: '*.mp4|*.webm') |
--exclude="*pattern*" | Exclude files matching pattern |
--format="FORMAT" | Download specific derivative format |
--source=SOURCE | Filter by source: original, derivative, metadata |
--exclude-source=SOURCE | Exclude by source type |
--destdir=path | Download to specific directory |
--no-directories | Flatten directory structure |
-s, --stdout | Write file to stdout (for piping) |
--dry-run | Show what would be downloaded |
--checksum | Skip files that already exist with correct checksum |
--on-the-fly | Download on-the-fly files (generated derivatives) |
--search="QUERY" | Download from search results |
--itemlist=FILE | Download items listed in file |
Filtering by Source Type
Use --source and --exclude-source to filter by file origin:
# Download only original files (skip all derivatives)
ia download my-item --source=original
# Download originals and metadata, skip derivatives
ia download my-item --exclude-source=derivative
# Download only metadata files
ia download my-item --source=metadata
Examples
# Download all files from an item
ia download TripDown1905
# Download specific files by name
ia download TripDown1905 file1.mp4 file2.ogv
# Download only MP4 files
ia download TripDown1905 --glob="*.mp4"
# Download MP4s but exclude low-quality versions
ia download TripDown1905 --glob="*.mp4" --exclude="*512kb*"
# Download specific format
ia download TripDown1905 --format='512Kb MPEG4'
# Download to specific directory
ia download TripDown1905 --destdir=./downloads
# Download from search results
ia download --search 'collection:opensource_movies' --glob="*.mp4"
# Download items from a list file
ia search 'collection:glasgowschoolofart' --itemlist > itemlist.txt
ia download --itemlist itemlist.txt
# Preview what will be downloaded
ia download my_item --dry-run
Upload Operations
Upload files to the Internet Archive (requires authentication):
ia upload <identifier> file1 file2 --metadata="mediatype:value"
Required Metadata
The mediatype field is required. Common values:
texts- Books, documents, PDFsmovies- Video filesaudio- Music, podcasts, soundsoftware- Programs, gamesimage- Photos, graphicsdata- Datasets, archives
Upload Parameters
| Parameter | Description |
|---|---|
--metadata="key:value" | Set metadata (repeatable) |
--header="key:value" | Set HTTP header |
--checksum | Skip files already uploaded |
-v, --verify | Verify data wasn't corrupted after upload |
--no-derive | Skip derivative processing |
--retries=N | Number of retry attempts |
--remote-name=NAME | Set remote filename (for stdin uploads) |
--keep-directories | Preserve directory structure in remote filename |
-o, --open-after-upload | Open item in browser after upload |
--file-metadata=FILE | File-level metadata from JSONL file |
--spreadsheet=FILE | Bulk upload from CSV spreadsheet |
Common Metadata Fields
--metadata="title:My Document Title"
--metadata="creator:Author Name"
--metadata="description:A description of the content"
--metadata="subject:topic1;topic2"
--metadata="collection:community_texts"
--metadata="date:2024-01-15"
--metadata="language:eng"
Examples
# Upload a PDF document
ia upload my-document-2024 document.pdf \
--metadata="mediatype:texts" \
--metadata="title:My Document" \
--metadata="creator:John Doe"
# Upload multiple files
ia upload my-archive file1.pdf file2.pdf file3.pdf \
--metadata="mediatype:texts" \
--metadata="title:Document Collection"
# Upload with checksum verification and retries
ia upload my-item large-file.zip \
--metadata="mediatype:data" \
--checksum \
--retries=10
# Upload from stdin
cat data.gz | ia upload my-item - \
--remote-name=data.gz \
--metadata="mediatype:data"
# Bulk upload using spreadsheet
ia upload --spreadsheet=metadata.csv
Notes:
- Items receive
datamediatype by default if not specified - Mediatype can only be changed after upload with admin support
- Derivative generation takes seconds to days depending on file type and system load
- Items typically appear in search within minutes, but can take up to 24 hours
Test Collection
Upload to test_collection for validation - items are automatically removed after ~30 days:
ia upload my-test-item file.pdf \
--metadata="mediatype:texts" \
--metadata="collection:test_collection"
Identifier Guidelines
- Use lowercase letters, numbers, and hyphens
- No spaces or special characters
- Keep it descriptive but concise
- Check if identifier exists:
ia metadata <identifier> --exists
Item Thumbnail Image
To set a custom thumbnail for an item, upload an image named <identifier>_itemimage.jpg:
ia upload my-item my-item_itemimage.jpg
Restricting Downloads
To make files streamable but not downloadable, add the item to the stream_only collection:
ia metadata <identifier> --append-list="collection:stream_only"
Metadata Operations
View and modify item metadata:
# View metadata (JSON output)
ia metadata <identifier>
# Extract specific field with jq
ia metadata <identifier> | jq '.metadata.date'
# List file formats contained in an item
ia metadata <identifier> --formats
# Modify metadata (set or replace)
ia metadata <identifier> --modify="title:New Title"
ia metadata <identifier> --modify="foo:bar" --modify="baz:value"
# Remove a metadata field
ia metadata <identifier> --modify="fieldname:REMOVE_TAG"
# Append value to existing field
ia metadata <identifier> --append="title:Subtitle Here"
# Append to list field (e.g., subjects)
ia metadata <identifier> --append-list="subject:new topic"
# Remove specific value from list field
ia metadata <identifier> --remove="subject:old topic"
# Modify file-level metadata
ia metadata <identifier> --target="files/foo.txt" --modify="title:My File"
# Bulk updates from spreadsheet
ia metadata --spreadsheet=metadata.csv
List Operations
List files in an Internet Archive item:
ia list <identifier>
Shows all files with details (name, size, format).
List Parameters
| Parameter | Description |
|---|---|
--columns=name,size | Specify columns to show |
--glob="*.pdf" | Filter by pattern |
-l, --location | Print full URLs for each file |
-a, --all | List all available file information |
-v, --verbose | Print column headers |
# List with full URLs
ia list my-item --location
# List all file info with headers
ia list my-item --all --verbose
# List specific columns
ia list my-item --columns=name,size,format
Tasks and Jobs
Check status of catalog tasks (uploads, derives, etc.):
# Check tasks for a specific item
ia tasks <identifier>
# Check all your tasks
ia tasks
Darking and Undarking Items
To make an item dark (hidden from public access) or undark it:
# Dark an item (requires comment)
ia tasks <identifier> --cmd=make_dark.php --comment="Reason for darking"
# Undark an item
ia tasks <identifier> --cmd=make_undark.php --comment="Reason for undarking"
Bulk Operations with GNU Parallel
For batch processing many items, use GNU Parallel to run ia commands concurrently.
Installation
# macOS
brew install parallel
# Debian/Ubuntu
apt install parallel
Basic Usage
Pipe item identifiers to parallel, using {} as placeholder:
# Fetch metadata for many items
cat itemlist.txt | parallel 'ia metadata {}'
# Download multiple items
cat itemlist.txt | parallel 'ia download {}'
Careful Batch Processing
For reliable bulk operations, use job logging to track progress and handle failures:
# Step 1: Create item list
ia search 'collection:myproject' --itemlist > itemlist.txt
# Step 2: Run with job logging
cat itemlist.txt | parallel --joblog job.log 'ia download {}'
# Step 3: Check for failures
echo $? # 0 = all succeeded
# Step 4: Retry only failed jobs
parallel --retry-failed --joblog job.log
Job Log Benefits
The --joblog file tracks each command's exit status, allowing you to:
- Resume interrupted batch jobs
- Retry only failed items without re-processing successes
- Audit what succeeded and failed
Dry Run First
Always preview before bulk execution:
cat itemlist.txt | parallel --dry-run 'ia download {}'
Rate Limiting
Control concurrency to avoid overwhelming the server:
# Limit to 4 concurrent jobs
cat itemlist.txt | parallel -j4 'ia download {}'
# Add delay between jobs
cat itemlist.txt | parallel --delay 1 'ia download {}'
See: https://archive.org/developers/internetarchive/parallel.html
Best Practices
- Always configure before uploading - Run
ia configurefirst - Use meaningful identifiers - Descriptive, lowercase, hyphenated
- Include proper metadata - At minimum: mediatype, title, creator
- Check before uploading - Verify identifier doesn't exist:
ia metadata <id> - Use checksums - Add
--checksumfor large uploads to enable resume - Respect rate limits - Don't spam requests; add delays for bulk operations
- Test with dry-run - Use
--dry-runto preview operations - Use test_collection first - Validate uploads before committing to permanent collections
- Zip large file sets - Bundle many small files into archives before uploading
- Specify language - Set
languagemetadata for proper OCR processing on texts
Error Handling
| Error | Solution |
|---|---|
| "not configured" | Run ia configure or set environment variables |
| "identifier exists" | Choose a different identifier |
| "permission denied" | Check credentials at https://archive.org/account/s3.php |
| "network error" | Retry the operation; check internet connection |
| "item not found" | Verify the identifier spelling |
| "429 Too Many Requests" | Rate limited; wait and retry with Retry-After header value |
| Item not appearing in search | Usually appears within minutes; check ia tasks <identifier> for pending jobs |
| Derive task failed | Check filename characters, file format, language metadata |
Quick Reference
# Search
ia search 'query'
ia search 'query' --itemlist
# Download
ia download <identifier>
ia download <identifier> --glob="*.pdf"
# Upload (requires auth)
ia upload <identifier> files --metadata="mediatype:texts"
# Metadata
ia metadata <identifier>
ia metadata <identifier> --modify="title:New Title"
# List files
ia list <identifier>
# Tasks
ia tasks <identifier>
# Config
ia configure
ia configure --whoami
# Install
uv tool install internetarchive
API Reference
For programmatic access beyond the CLI, see the full developer documentation: https://archive.org/developers
Core APIs
| API | Description |
|---|---|
| Items | Understanding item structure and access |
| Metadata Schema | Complete metadata field reference |
| Metadata Read | Retrieve item metadata via API |
| Metadata Write | Modify item metadata via API |
| IAS3 | S3-compatible API for uploads |
| Tasks | Task queue management |
Additional APIs
| API | Description |
|---|---|
| Changes | Track item modifications across the archive |
| Views | Access viewing and download statistics |
| Reviews | Manage item reviews |
| Simple Lists | Create item relationships and lists |
| OCR Service | Text recognition service |
| PDF Service | PDF generation and processing |
Python Library
For Python integration: internetarchive library
TypeScript Library (Third-Party)
A community-maintained TypeScript port is available: internetarchive-ts (docs)
Note: This is a work in progress and not officially maintained by the Internet Archive.
