askill
mistral-performance-tuning

mistral-performance-tuningSafety 90Repository

Optimize Mistral AI performance with caching, batching, and latency reduction. Use when experiencing slow API responses, implementing caching strategies, or optimizing request throughput for Mistral AI integrations. Trigger with phrases like "mistral performance", "optimize mistral", "mistral latency", "mistral caching", "mistral slow", "mistral batch".

27 stars
1.2k downloads
Updated 2/8/2026

Package Files

Loading files...
SKILL.md

Mistral AI Performance Tuning

Overview

Optimize Mistral AI API performance with caching, batching, and latency reduction techniques.

Prerequisites

  • Mistral AI SDK installed
  • Understanding of async patterns
  • Redis or in-memory cache available (optional)
  • Performance monitoring in place

Latency Benchmarks

ModelP50P95P99Use Case
mistral-small-latest200ms500ms1sFast responses
mistral-large-latest500ms1.5s3sComplex reasoning
mistral-embed50ms150ms300msEmbeddings

Instructions

Step 1: Response Caching

import { LRUCache } from 'lru-cache';
import crypto from 'crypto';

const cache = new LRUCache<string, any>({
  max: 1000,
  ttl: 5 * 60 * 1000, // 5 minutes
  updateAgeOnGet: true,
});

function getCacheKey(messages: any[], model: string, options?: any): string {
  const data = JSON.stringify({ messages, model, options });
  return crypto.createHash('sha256').update(data).digest('hex');
}

async function cachedChat(
  client: Mistral,
  messages: any[],
  model: string,
  options?: { temperature?: number; maxTokens?: number }
): Promise<string> {
  // Only cache deterministic requests (temperature = 0)
  const isCacheable = (options?.temperature ?? 0.7) === 0;

  if (isCacheable) {
    const key = getCacheKey(messages, model, options);
    const cached = cache.get(key);
    if (cached) {
      console.log('Cache hit');
      return cached;
    }
  }

  const response = await client.chat.complete({
    model,
    messages,
    ...options,
  });

  const content = response.choices?.[0]?.message?.content ?? '';

  if (isCacheable) {
    const key = getCacheKey(messages, model, options);
    cache.set(key, content);
  }

  return content;
}

Step 2: Redis Distributed Caching

import Redis from 'ioredis';
import crypto from 'crypto';

const redis = new Redis(process.env.REDIS_URL);

async function cachedWithRedis<T>(
  key: string,
  fetcher: () => Promise<T>,
  ttlSeconds = 300
): Promise<T> {
  const cached = await redis.get(key);
  if (cached) {
    return JSON.parse(cached);
  }

  const result = await fetcher();
  await redis.setex(key, ttlSeconds, JSON.stringify(result));
  return result;
}

// Semantic cache for similar queries
async function semanticCache(
  client: Mistral,
  query: string,
  threshold = 0.95
): Promise<string | null> {
  // Get embedding for query
  const queryEmbed = await client.embeddings.create({
    model: 'mistral-embed',
    inputs: [query],
  });
  const queryVector = queryEmbed.data[0].embedding;

  // Check cache for similar queries
  const cachedQueries = await redis.keys('semantic:*');

  for (const key of cachedQueries) {
    const cached = JSON.parse(await redis.get(key) || '{}');
    const similarity = cosineSimilarity(queryVector, cached.embedding);

    if (similarity >= threshold) {
      console.log(`Semantic cache hit (similarity: ${similarity.toFixed(3)})`);
      return cached.response;
    }
  }

  return null;
}

Step 3: Request Batching

import DataLoader from 'dataloader';

// Batch embedding requests
const embeddingLoader = new DataLoader<string, number[]>(
  async (texts) => {
    const response = await client.embeddings.create({
      model: 'mistral-embed',
      inputs: texts as string[],
    });
    return response.data.map(d => d.embedding);
  },
  {
    maxBatchSize: 100, // Mistral limit
    batchScheduleFn: callback => setTimeout(callback, 10), // 10ms window
  }
);

// Usage - automatically batched
const [embed1, embed2, embed3] = await Promise.all([
  embeddingLoader.load('Text 1'),
  embeddingLoader.load('Text 2'),
  embeddingLoader.load('Text 3'),
]);

Step 4: Connection Optimization

import { Agent } from 'https';
import Mistral from '@mistralai/mistralai';

// Keep-alive connection pooling
const agent = new Agent({
  keepAlive: true,
  maxSockets: 10,
  maxFreeSockets: 5,
  timeout: 60000,
});

// Note: Check if Mistral client supports custom agents
// If not, connection pooling happens at the HTTP level

Step 5: Streaming for Perceived Performance

// Streaming reduces Time to First Token (TTFT)
async function* streamWithMetrics(
  client: Mistral,
  messages: any[],
  model: string
): AsyncGenerator<{ content: string; metrics: any }> {
  const startTime = Date.now();
  let firstTokenTime: number | null = null;
  let tokenCount = 0;

  const stream = await client.chat.stream({ model, messages });

  for await (const event of stream) {
    const content = event.data?.choices?.[0]?.delta?.content;
    if (content) {
      if (!firstTokenTime) {
        firstTokenTime = Date.now();
      }
      tokenCount++;
      yield {
        content,
        metrics: {
          ttft: firstTokenTime - startTime,
          tokensPerSecond: tokenCount / ((Date.now() - startTime) / 1000),
        },
      };
    }
  }
}

// Usage
let fullResponse = '';
for await (const { content, metrics } of streamWithMetrics(client, messages, 'mistral-small-latest')) {
  fullResponse += content;
  process.stdout.write(content);
}
console.log(`\nTTFT: ${metrics.ttft}ms, Speed: ${metrics.tokensPerSecond.toFixed(1)} tok/s`);

Step 6: Model Selection for Speed

type SpeedTier = 'fastest' | 'balanced' | 'quality';

function selectModelForSpeed(tier: SpeedTier, taskComplexity: 'low' | 'medium' | 'high'): string {
  const matrix = {
    fastest: {
      low: 'mistral-small-latest',
      medium: 'mistral-small-latest',
      high: 'mistral-small-latest',
    },
    balanced: {
      low: 'mistral-small-latest',
      medium: 'mistral-small-latest',
      high: 'mistral-large-latest',
    },
    quality: {
      low: 'mistral-small-latest',
      medium: 'mistral-large-latest',
      high: 'mistral-large-latest',
    },
  };

  return matrix[tier][taskComplexity];
}

Step 7: Performance Monitoring

interface PerformanceMetrics {
  model: string;
  latencyMs: number;
  ttftMs?: number;
  tokensPerSecond?: number;
  inputTokens: number;
  outputTokens: number;
  cached: boolean;
}

async function measurePerformance(
  operation: () => Promise<any>,
  metadata: Partial<PerformanceMetrics>
): Promise<{ result: any; metrics: PerformanceMetrics }> {
  const start = Date.now();

  const result = await operation();

  const metrics: PerformanceMetrics = {
    model: metadata.model || 'unknown',
    latencyMs: Date.now() - start,
    inputTokens: result.usage?.promptTokens || 0,
    outputTokens: result.usage?.completionTokens || 0,
    cached: metadata.cached || false,
    ...metadata,
  };

  // Log to monitoring system
  console.log('[PERF]', JSON.stringify(metrics));

  return { result, metrics };
}

// Usage
const { result, metrics } = await measurePerformance(
  () => client.chat.complete({ model, messages }),
  { model, cached: false }
);

Output

  • Reduced API latency
  • Caching layer implemented
  • Request batching enabled
  • Performance monitoring active

Error Handling

IssueCauseSolution
Cache miss stormTTL expiredUse stale-while-revalidate
Batch timeoutToo many itemsReduce batch size
Memory pressureCache too largeSet max cache entries
Slow TTFTLarge promptsReduce prompt size or use smaller model

Examples

Quick Performance Wrapper

const withPerformance = async <T>(
  name: string,
  fn: () => Promise<T>
): Promise<T> => {
  const start = Date.now();
  const result = await fn();
  console.log(`[${name}] ${Date.now() - start}ms`);
  return result;
};

// Usage
const response = await withPerformance('chat', () =>
  client.chat.complete({ model, messages })
);

Parallel Requests with Concurrency Limit

import pLimit from 'p-limit';

const limit = pLimit(5); // Max 5 concurrent requests

const results = await Promise.all(
  prompts.map(prompt =>
    limit(() => client.chat.complete({
      model: 'mistral-small-latest',
      messages: [{ role: 'user', content: prompt }],
    }))
  )
);

Resources

Next Steps

For cost optimization, see mistral-cost-tuning.

Install

Download ZIP
Requires askill CLI v1.0+

AI Quality Score

95/100Analyzed 2/9/2026

An exceptionally high-quality skill providing comprehensive, actionable, and well-structured technical guidance for optimizing Mistral AI performance across multiple dimensions.

90
95
92
95
98

Metadata

Licenseunknown
Version1.0.0
Updated2/8/2026
PublisherDicklesworthstone

Tags

apigithubgraphqlobservabilityprompting