askill
ia

iaSafety 92Repository

Interact with Internet Archive (archive.org) - upload files, download items, and search the archive using the ia CLI tool. Use when working with archive.org, archiving content, or retrieving historical data.

0 stars
1.2k downloads
Updated 2/15/2026

Package Files

Loading files...
SKILL.md

Internet Archive CLI Skill

This skill enables interaction with the Internet Archive (archive.org) using the ia command-line tool from the internetarchive Python package.

Items

An item is the fundamental unit on archive.org - a logical grouping of related files sharing common metadata. An item can be a book, a song, an album, a dataset, a movie, an image or set of images, etc. Each item has a unique identifier across the entire archive.

Every item contains:

  • Original uploaded files
  • Derivative files (automatically generated by archive.org)
  • <identifier>_meta.xml - item-level metadata
  • <identifier>_files.xml - file-level metadata

Items must belong to a collection.

Item Limits

ConstraintRecommendedHard Limit
Item total sizeUnder 100GB~1TB
Files per itemUnder 10,000250,000 (performance degrades >10,000)
Single file sizeUnder 50GB500-700GB
Daily uploadUnder 1,000 files5,000 files (zips count as 1)

Permanent URL patterns:

  • Details page: https://archive.org/details/<identifier>
  • Download directory: https://archive.org/download/<identifier>
  • Specific file: https://archive.org/download/<identifier>/<filename>
  • Item history: https://archive.org/history/<identifier>

Warning: Never link to server-specific URLs like ia802304.us.archive.org - these break when items migrate between servers. Always use the canonical archive.org URLs above.

For more details, see: https://archive.org/developers/items.html

Derivatives

When you upload files to the Internet Archive, the system automatically generates derivative files - converted versions in different formats and resolutions. For example:

  • Video: Transcoded to h.264, Ogg, and various bitrates
  • Audio: Converted to MP3 (multiple bitrates), Ogg Vorbis, FLAC
  • Text/Books: OCR processing, searchable PDFs, EPUB, DjVu
  • Images: Thumbnails, JPEG 2000, different resolutions

Derivatives make content accessible across different devices and bandwidths. You can identify derivatives in ia list output - they have an original field pointing to their source file.

To skip derivative generation during upload, use --no-derive:

ia upload my-item file.mp4 --metadata="mediatype:movies" --no-derive

For the complete list of source formats and their generated derivatives, see: https://archive.org/help/derivatives.php

Metadata Schema

Internet Archive items use XML-based metadata. Key points:

  • Required fields: identifier, mediatype
  • Recommended fields: title, description, creator, date, subject, collection, language
  • Repeatable fields: collections, creators, subjects, languages support multiple values
  • Custom fields: You can define unlimited custom metadata fields (must follow XML naming rules)

Identifier requirements:

  • ASCII alphanumeric, underscores, dashes, or periods only
  • Must begin with alphanumeric character
  • 5-100 characters (5-80 recommended)
  • Unique and unchangeable once set

For the complete metadata schema reference, see: https://archive.org/developers/metadata-schema

Collections

Collections group related items together. Key points:

  • Only IA staff can create collections - users must request creation
  • Minimum 50 items required for a new collection
  • Items must be related and typically same media type
  • Collection creation takes up to two weeks after request

To request a collection, contact Internet Archive with:

  • List of item identifiers or search query identifying items
  • Desired collection identifier (5-80 chars, alphanumeric only)
  • Collection title and description
  • At least one subject tag

Public upload collections (anyone can upload to):

  • opensource_movies, opensource_audio, opensource_media - general media
  • community_texts, community_video, community_audio - community contributions

Other collections restrict uploads to designated uploaders only.

Tool Detection and Installation

Before using any ia commands, check if the tool is installed:

ia --version

If the ia command is not found, install it using uv:

uv tool install internetarchive

Alternative installation methods:

  • pipx install internetarchive
  • pip install internetarchive

After installation, verify it works with ia --version.

Global Options

These options work with all ia commands:

OptionDescription
-h, --helpShow help message
-v, --versionDisplay version
-c FILE, --config-filePath to config file
-l, --logEnable logging
-d, --debugEnable debug output

Configuration and Authentication

Check if ia is configured:

ia configure --whoami

If not configured (shows error or empty), the user needs to set up credentials:

  1. Interactive setup: Run ia configure and follow prompts
  2. Get credentials: IA-S3 keys from https://archive.org/account/s3.php
  3. Config location: Saves to ~/.config/ia.ini

Configure Options

OptionDescription
--whoamiPrint current authenticated user
--showPrint current config as JSON
--checkValidate IA-S3 keys (exit 0 if valid, 1 otherwise)
# Show current config
ia configure --show

# Validate keys (useful in scripts)
ia configure --check && echo "Keys valid"

Environment Variables

Alternative to config file:

export IA_ACCESS_KEY_ID="your-access-key"
export IA_SECRET_ACCESS_KEY="your-secret-key"

Note: Configuration is required for uploads and metadata modifications. Searching and downloading public items works without authentication.

User-Agent Identification (Required)

All requests to the Internet Archive must include a proper User-Agent string that clearly identifies the source of the request. This applies to every request made via any tool - the ia CLI, Python library, direct API calls, curl, or any other HTTP client. This is critical for AI agents, bots, and automated tools.

The ia CLI automatically includes a default User-Agent with your access key:

internetarchive/5.7.2 (Linux x86_64; N; en; ACCESS_KEY) Python/3.11.0

When using Claude Code or other AI/LLM agents, you must append a custom suffix that includes:

  • The tool/agent name and version (e.g., "Claude Code/1.0.0")
  • The model being used if applicable (e.g., "claude-sonnet-4-20250514")
  • Any relevant context about the automation

The --user-agent-suffix CLI option and user_agent_suffix config setting require internetarchive version 5.7.2 or newer. The default User-Agent (including access key) is always sent - your suffix is appended to it.

CLI:

ia --user-agent-suffix "Claude Code/1.0.0 (claude-sonnet-4-20250514)" download my-item

INI file (~/.config/internetarchive/ia.ini):

[general]
user_agent_suffix = Claude Code/1.0.0 (claude-sonnet-4-20250514)

Python API:

from internetarchive import get_session

session = get_session(config={
    'general': {'user_agent_suffix': 'Claude Code/1.0.0 (claude-sonnet-4-20250514)'}
})

The resulting User-Agent will look like:

internetarchive/5.7.2 (Linux x86_64; N; en; ACCESS_KEY) Python/3.11.0 Claude Code/1.0.0 (claude-sonnet-4-20250514)

This helps the Internet Archive track usage patterns, troubleshoot issues, and maintain service quality. Always be specific - include version numbers, model identifiers, and enough detail to distinguish your tool from others.

Search Operations

Search the Internet Archive catalog:

ia search '<query>'

Search Parameters

ParameterDescription
--itemlistOutput identifiers only, one per line
-n, --num-foundPrint only the count of results
-s, --sortSort results: --sort='field desc' or --sort='field asc'
-f, --fieldReturn specific metadata fields (repeatable)
-F, --ftsFull-text search (search within text content, not just metadata)
--parametersRaw query parameters: --parameters="page=N&rows=N"
# Get result count only
ia search 'collection:nasa' -n

# Sort by date descending
ia search 'mediatype:texts' --sort='date desc'

# Return specific fields
ia search 'collection:nasa' --field=identifier --field=title

Sort Fields

Common sort fields for use with --sort:

FieldDescription
dateContent date
publicdateWhen item was published to archive.org
addeddateWhen added to archive
updatedateLast updated
title / titleSorterAlphabetical by title
creator / creatorSorterAlphabetical by creator
downloadsTotal downloads
weekDownloads this week
monthDownloads this month
num_reviewsNumber of reviews
num_favoritesNumber of favorites
item_sizeTotal item size
files_countNumber of files

Use asc or desc suffix:

ia search 'mediatype:audio' --sort='downloads desc'
ia search 'collection:books' --sort='publicdate asc'
ia search 'creator:NASA' --sort='title asc'

Search Query Syntax

The Internet Archive uses Apache Lucene query syntax. By default, the operator is AND (all terms must be present).

Query Operators

OperatorDescription
ANDAll terms must be present (default)
ORAny of the terms can be present
NOTExclude documents with term (requires at least one positive term)
( )Group clauses to form subqueries

Field-Specific Searches

Use field:value syntax to search specific metadata fields:

QueryDescription
'title:"search text"'By title
'creator:"Author Name"'By creator/author
'subject:"topic"'Search by subject
'description:"text"'By description
'collection:name'Items in a collection
'mediatype:texts'By media type (texts, movies, audio, software, image, data)
'contributor:smithsonian'By contributor
'language:eng'By language code
'format:pdf'Items containing specific file format
'isbn:9780123456789'By ISBN
'licenseurl:http*by-nc*'By Creative Commons license

Range Queries

Search values between bounds using brackets or parentheses:

SyntaxDescription
[1000 TO 2000]Inclusive range (includes bounds)
{1000 TO 2000}Exclusive range (excludes bounds)
[1000 TO null]Open-ended range (1000 or greater)
[null TO 2000]Open-ended range (2000 or less)

Date Fields

Searchable date fields: addeddate, createdate, date, indexdate, publicdate, reviewdate, updatedate, oai_updatedate

QueryDescription
'date:[2020-01-01 TO 2024-12-31]'Date range
'publicdate:[2024-01-01 TO 2024-06-30]'By publication date
'indexdate:[2024-01-01T00:00:00Z TO 2024-12-31T23:59:59Z]'With timestamp
'date:2024*'Wildcard for year (non-range)

Fuzzy Queries

Append ~ for approximate spelling matches:

ia search 'title:buttonwood~'

# Boost fuzzy matches with weights
ia search '(title:buttonwood~)^150 OR (subject:buttonwood~)^100'

Searching for Missing Fields

Find items where a field doesn't exist:

ia search 'collection:microfiche AND NOT _exists_:creator'

Searching by Uploader

Search by uploader's user item, screen name, or email:

ia search '_uploader_useritem:@username'
ia search '_uploader_screenname:"Display Name"'
ia search 'uploader:your@email.com'

Additional Searchable Fields

Beyond standard metadata, you can search by:

  • downloads - download count
  • item_size - total item size in bytes
  • files_count - number of files
  • collection_size - size of collection
  • item_count - items in collection
ia search 'collection:opensource AND downloads:[1000 TO null]'
ia search 'mediatype:movies AND item_size:[1000000000 TO null]'

Combined Queries

# AND is implicit between terms
ia search 'collection:nasa mediatype:image'

# Explicit operators
ia search 'collection:nasa AND mediatype:image'
ia search 'mediatype:texts OR mediatype:audio'
ia search 'collection:opensource NOT mediatype:software'

# Grouped subqueries
ia search '(mediatype:texts OR mediatype:audio) AND creator:"Mark Twain"'

Full-Text Search

Use the -F (or --fts) flag to search within the actual text content of items rather than just metadata. This is particularly powerful for searching text collections like books, documents, and OCR'd materials.

Basic full-text search:

ia search -F 'collection:collection_name "search phrase"'

How it works:

  • Searches inside the full text of documents (OCR'd PDFs, text files, etc.)
  • More powerful than metadata-only search for finding specific quotes or passages
  • Requires items to have searchable text (OCR or text files)
  • Can be combined with collection and metadata filters

Full-text search syntax:

  • Use quotes for exact phrases: "complete phrase"
  • Combine with metadata filters: collection:name AND "text to find"
  • Works best with text collections that have been OCR'd

Examples

# Search NASA images
ia search 'collection:nasa mediatype:image' --parameters="rows=10"

# Search public domain books
ia search 'subject:"public domain" mediatype:texts'

# Get just identifiers
ia search 'creator:"Mark Twain"' --itemlist

# Full-text search within a text collection
ia search -F 'collection:books "climate change"'

# Full-text search for a specific quote in public domain texts
ia search -F '"to be or not to be" mediatype:texts'

# Full-text search with collection filter and pagination
ia search -F 'collection:usgovernmentdocuments "artificial intelligence"' --parameters="rows=20"

Download Operations

Download files from an Internet Archive item:

ia download <identifier>

Download Parameters

ParameterDescription
--glob="*.ext"Download only matching files (use | for multiple: '*.mp4|*.webm')
--exclude="*pattern*"Exclude files matching pattern
--format="FORMAT"Download specific derivative format
--source=SOURCEFilter by source: original, derivative, metadata
--exclude-source=SOURCEExclude by source type
--destdir=pathDownload to specific directory
--no-directoriesFlatten directory structure
-s, --stdoutWrite file to stdout (for piping)
--dry-runShow what would be downloaded
--checksumSkip files that already exist with correct checksum
--on-the-flyDownload on-the-fly files (generated derivatives)
--search="QUERY"Download from search results
--itemlist=FILEDownload items listed in file

Filtering by Source Type

Use --source and --exclude-source to filter by file origin:

# Download only original files (skip all derivatives)
ia download my-item --source=original

# Download originals and metadata, skip derivatives
ia download my-item --exclude-source=derivative

# Download only metadata files
ia download my-item --source=metadata

Examples

# Download all files from an item
ia download TripDown1905

# Download specific files by name
ia download TripDown1905 file1.mp4 file2.ogv

# Download only MP4 files
ia download TripDown1905 --glob="*.mp4"

# Download MP4s but exclude low-quality versions
ia download TripDown1905 --glob="*.mp4" --exclude="*512kb*"

# Download specific format
ia download TripDown1905 --format='512Kb MPEG4'

# Download to specific directory
ia download TripDown1905 --destdir=./downloads

# Download from search results
ia download --search 'collection:opensource_movies' --glob="*.mp4"

# Download items from a list file
ia search 'collection:glasgowschoolofart' --itemlist > itemlist.txt
ia download --itemlist itemlist.txt

# Preview what will be downloaded
ia download my_item --dry-run

Upload Operations

Upload files to the Internet Archive (requires authentication):

ia upload <identifier> file1 file2 --metadata="mediatype:value"

Required Metadata

The mediatype field is required. Common values:

  • texts - Books, documents, PDFs
  • movies - Video files
  • audio - Music, podcasts, sound
  • software - Programs, games
  • image - Photos, graphics
  • data - Datasets, archives

Upload Parameters

ParameterDescription
--metadata="key:value"Set metadata (repeatable)
--header="key:value"Set HTTP header
--checksumSkip files already uploaded
-v, --verifyVerify data wasn't corrupted after upload
--no-deriveSkip derivative processing
--retries=NNumber of retry attempts
--remote-name=NAMESet remote filename (for stdin uploads)
--keep-directoriesPreserve directory structure in remote filename
-o, --open-after-uploadOpen item in browser after upload
--file-metadata=FILEFile-level metadata from JSONL file
--spreadsheet=FILEBulk upload from CSV spreadsheet

Common Metadata Fields

--metadata="title:My Document Title"
--metadata="creator:Author Name"
--metadata="description:A description of the content"
--metadata="subject:topic1;topic2"
--metadata="collection:community_texts"
--metadata="date:2024-01-15"
--metadata="language:eng"

Examples

# Upload a PDF document
ia upload my-document-2024 document.pdf \
  --metadata="mediatype:texts" \
  --metadata="title:My Document" \
  --metadata="creator:John Doe"

# Upload multiple files
ia upload my-archive file1.pdf file2.pdf file3.pdf \
  --metadata="mediatype:texts" \
  --metadata="title:Document Collection"

# Upload with checksum verification and retries
ia upload my-item large-file.zip \
  --metadata="mediatype:data" \
  --checksum \
  --retries=10

# Upload from stdin
cat data.gz | ia upload my-item - \
  --remote-name=data.gz \
  --metadata="mediatype:data"

# Bulk upload using spreadsheet
ia upload --spreadsheet=metadata.csv

Notes:

  • Items receive data mediatype by default if not specified
  • Mediatype can only be changed after upload with admin support
  • Derivative generation takes seconds to days depending on file type and system load
  • Items typically appear in search within minutes, but can take up to 24 hours

Test Collection

Upload to test_collection for validation - items are automatically removed after ~30 days:

ia upload my-test-item file.pdf \
  --metadata="mediatype:texts" \
  --metadata="collection:test_collection"

Identifier Guidelines

  • Use lowercase letters, numbers, and hyphens
  • No spaces or special characters
  • Keep it descriptive but concise
  • Check if identifier exists: ia metadata <identifier> --exists

Item Thumbnail Image

To set a custom thumbnail for an item, upload an image named <identifier>_itemimage.jpg:

ia upload my-item my-item_itemimage.jpg

Restricting Downloads

To make files streamable but not downloadable, add the item to the stream_only collection:

ia metadata <identifier> --append-list="collection:stream_only"

Metadata Operations

View and modify item metadata:

# View metadata (JSON output)
ia metadata <identifier>

# Extract specific field with jq
ia metadata <identifier> | jq '.metadata.date'

# List file formats contained in an item
ia metadata <identifier> --formats

# Modify metadata (set or replace)
ia metadata <identifier> --modify="title:New Title"
ia metadata <identifier> --modify="foo:bar" --modify="baz:value"

# Remove a metadata field
ia metadata <identifier> --modify="fieldname:REMOVE_TAG"

# Append value to existing field
ia metadata <identifier> --append="title:Subtitle Here"

# Append to list field (e.g., subjects)
ia metadata <identifier> --append-list="subject:new topic"

# Remove specific value from list field
ia metadata <identifier> --remove="subject:old topic"

# Modify file-level metadata
ia metadata <identifier> --target="files/foo.txt" --modify="title:My File"

# Bulk updates from spreadsheet
ia metadata --spreadsheet=metadata.csv

List Operations

List files in an Internet Archive item:

ia list <identifier>

Shows all files with details (name, size, format).

List Parameters

ParameterDescription
--columns=name,sizeSpecify columns to show
--glob="*.pdf"Filter by pattern
-l, --locationPrint full URLs for each file
-a, --allList all available file information
-v, --verbosePrint column headers
# List with full URLs
ia list my-item --location

# List all file info with headers
ia list my-item --all --verbose

# List specific columns
ia list my-item --columns=name,size,format

Tasks and Jobs

Check status of catalog tasks (uploads, derives, etc.):

# Check tasks for a specific item
ia tasks <identifier>

# Check all your tasks
ia tasks

Darking and Undarking Items

To make an item dark (hidden from public access) or undark it:

# Dark an item (requires comment)
ia tasks <identifier> --cmd=make_dark.php --comment="Reason for darking"

# Undark an item
ia tasks <identifier> --cmd=make_undark.php --comment="Reason for undarking"

Bulk Operations with GNU Parallel

For batch processing many items, use GNU Parallel to run ia commands concurrently.

Installation

# macOS
brew install parallel

# Debian/Ubuntu
apt install parallel

Basic Usage

Pipe item identifiers to parallel, using {} as placeholder:

# Fetch metadata for many items
cat itemlist.txt | parallel 'ia metadata {}'

# Download multiple items
cat itemlist.txt | parallel 'ia download {}'

Careful Batch Processing

For reliable bulk operations, use job logging to track progress and handle failures:

# Step 1: Create item list
ia search 'collection:myproject' --itemlist > itemlist.txt

# Step 2: Run with job logging
cat itemlist.txt | parallel --joblog job.log 'ia download {}'

# Step 3: Check for failures
echo $?  # 0 = all succeeded

# Step 4: Retry only failed jobs
parallel --retry-failed --joblog job.log

Job Log Benefits

The --joblog file tracks each command's exit status, allowing you to:

  • Resume interrupted batch jobs
  • Retry only failed items without re-processing successes
  • Audit what succeeded and failed

Dry Run First

Always preview before bulk execution:

cat itemlist.txt | parallel --dry-run 'ia download {}'

Rate Limiting

Control concurrency to avoid overwhelming the server:

# Limit to 4 concurrent jobs
cat itemlist.txt | parallel -j4 'ia download {}'

# Add delay between jobs
cat itemlist.txt | parallel --delay 1 'ia download {}'

See: https://archive.org/developers/internetarchive/parallel.html

Best Practices

  1. Always configure before uploading - Run ia configure first
  2. Use meaningful identifiers - Descriptive, lowercase, hyphenated
  3. Include proper metadata - At minimum: mediatype, title, creator
  4. Check before uploading - Verify identifier doesn't exist: ia metadata <id>
  5. Use checksums - Add --checksum for large uploads to enable resume
  6. Respect rate limits - Don't spam requests; add delays for bulk operations
  7. Test with dry-run - Use --dry-run to preview operations
  8. Use test_collection first - Validate uploads before committing to permanent collections
  9. Zip large file sets - Bundle many small files into archives before uploading
  10. Specify language - Set language metadata for proper OCR processing on texts

Error Handling

ErrorSolution
"not configured"Run ia configure or set environment variables
"identifier exists"Choose a different identifier
"permission denied"Check credentials at https://archive.org/account/s3.php
"network error"Retry the operation; check internet connection
"item not found"Verify the identifier spelling
"429 Too Many Requests"Rate limited; wait and retry with Retry-After header value
Item not appearing in searchUsually appears within minutes; check ia tasks <identifier> for pending jobs
Derive task failedCheck filename characters, file format, language metadata

Quick Reference

# Search
ia search 'query'
ia search 'query' --itemlist

# Download
ia download <identifier>
ia download <identifier> --glob="*.pdf"

# Upload (requires auth)
ia upload <identifier> files --metadata="mediatype:texts"

# Metadata
ia metadata <identifier>
ia metadata <identifier> --modify="title:New Title"

# List files
ia list <identifier>

# Tasks
ia tasks <identifier>

# Config
ia configure
ia configure --whoami

# Install
uv tool install internetarchive

API Reference

For programmatic access beyond the CLI, see the full developer documentation: https://archive.org/developers

Core APIs

APIDescription
ItemsUnderstanding item structure and access
Metadata SchemaComplete metadata field reference
Metadata ReadRetrieve item metadata via API
Metadata WriteModify item metadata via API
IAS3S3-compatible API for uploads
TasksTask queue management

Additional APIs

APIDescription
ChangesTrack item modifications across the archive
ViewsAccess viewing and download statistics
ReviewsManage item reviews
Simple ListsCreate item relationships and lists
OCR ServiceText recognition service
PDF ServicePDF generation and processing

Python Library

For Python integration: internetarchive library

TypeScript Library (Third-Party)

A community-maintained TypeScript port is available: internetarchive-ts (docs)

Note: This is a work in progress and not officially maintained by the Internet Archive.

Install

Download ZIP
Requires askill CLI v1.0+

AI Quality Score

88/100Analyzed 2/18/2026

Comprehensive technical reference for Internet Archive CLI tool covering items, metadata, collections, search, and authentication. Well-structured with tables, code examples, and important safety notes (User-Agent requirements). Contains structured steps for installation and configuration. Tags seem mismatched (github, llm, etc. don't relate to archive.org). Located in proper skills folder. High reusability as it's tool-focused, not project-specific.

92
88
95
90
85

Metadata

Licenseunknown
Version-
Updated2/15/2026
PublisherRamblurr

Tags

apigithubllmobservabilitysecuritytesting