askill
accuracy-evaluator

accuracy-evaluatorSafety 95Repository

This skill should be used when evaluating translation accuracy. It assesses semantic fidelity using backtranslation comparison, glossary term application, and format preservation. It provides a 0-5 score with detailed reasoning.

1 stars
1.2k downloads
Updated 1/6/2026

Package Files

Loading files...
SKILL.md

Accuracy Evaluator Skill

This skill evaluates translation accuracy by analyzing semantic fidelity, terminology consistency, and format integrity.

Role

You evaluate objectively using the backtranslation as a verification tool.

Behavior

<investigate_before_answering> Compare the backtranslation with the original text before judging semantic accuracy. Do not assume meaning is preserved - verify it through comparison. Check each glossary term individually. </investigate_before_answering>

<conservative_scoring> When uncertain between two scores, choose the lower score. It is better to flag potential issues than to miss them. </conservative_scoring>

Evaluation Procedure

Compare the original text with the backtranslation:

  • Identify any meaning that was lost in translation
  • Identify any meaning that was added (not in original)
  • Identify any meaning that was distorted or reversed
  • Note subtle nuance changes

Rate semantic fidelity:

  • Complete: All meaning preserved exactly
  • Minor loss: Small nuances lost but core meaning intact
  • Partial: Some significant meaning lost or added
  • Major: Core meaning distorted
  • Failed: Meaning reversed or completely wrong

Step 2: Terminology Verification (Glossary Compliance)

For each glossary term in the source:

  • Check if the correct translation was used
  • Verify brand names are exact matches
  • Confirm product names follow the glossary
  • Note any deviations or alternatives used

Rate terminology compliance:

  • Perfect: All glossary terms correctly applied
  • Minor: 1 term with acceptable alternative
  • Partial: Multiple terms incorrect or missing
  • Failed: Brand names or critical terms wrong

Step 3: Format Integrity Check

Verify preservation of:

  • HTML tags (<a>, </a>, <b>, <br>, etc.)
  • Placeholders ({0}, {1}, %s, %d, etc.)
  • Special characters and escapes
  • Line breaks and paragraph structure
  • Numbers, dates, units

Rate format integrity:

  • Perfect: All format elements preserved
  • Minor: Whitespace or minor formatting differences
  • Partial: 1 tag or placeholder affected
  • Failed: Multiple format elements broken

Step 4: Calculate Final Score

Combine the three assessments:

  • Semantic accuracy: 50% weight
  • Terminology compliance: 30% weight
  • Format integrity: 20% weight

Apply the scoring rubric to determine final score (0-5).

Scoring Rubric

4점 (Minor Issues) - Pass with Notes

  • Core meaning preserved, minor nuance differences
  • Glossary terms correct, possibly 1 acceptable alternative
  • Format elements intact
  • Corrections are optional improvements

3점 (Borderline) - Requires Review

  • Some meaning lost or subtle additions
  • 1-2 glossary terms incorrect or missing
  • Minor format issues
  • Requires human review or regeneration

2점 (Significant Issues) - Fail

  • Noticeable meaning distortion
  • Multiple glossary violations
  • Format elements broken
  • Must be regenerated

1점 (Severe Errors) - Fail

  • Major meaning reversal or loss
  • Brand names or critical terms wrong
  • Multiple format failures
  • Potentially harmful if published

0점 (Unusable) - Fail

  • Translation unrelated to source
  • Complete format destruction
  • Cannot be salvaged

Few-Shot Examples

원문: ABC 클라우드는 사용자의 ABC 계정과 연동된 정보를 동기화합니다.
번역: ABC Cloud syncs information linked to your ABC account.
역번역: ABC 클라우드는 ABC 계정에 연결된 정보를 동기화합니다.
용어집: {{"ABC 클라우드": "ABC Cloud", "ABC 계정": "ABC account", "동기화": "sync"}}

Evaluation:

  • Step 1 (Semantic): 역번역이 원문과 의미적으로 완전히 일치. 핵심 의미 100% 보존.
  • Step 2 (Terminology): "ABC 클라우드"→"ABC Cloud", "ABC 계정"→"ABC account", "동기화"→"sync" 모두 정확.
  • Step 3 (Format): 특수 포맷 없음. 해당 없음.
  • Score: 5 - 의미, 용어, 포맷 모두 완벽.

Example 2: Score 4 (Minor Issues)

원문: 데이터를 백업하고 복원할 수 있습니다.
번역: You can backup and restore your data.
역번역: 데이터를 백업하고 복원할 수 있습니다.
용어집: {{"백업": "back up", "복원": "restore"}}

Evaluation:

  • Step 1 (Semantic): 의미 완전 일치.
  • Step 2 (Terminology): "backup" 사용됨. 용어집에서는 "back up" (동사, 두 단어) 권장. 의미상 동일하나 스타일 차이.
  • Step 3 (Format): 포맷 완전.
  • Score: 4 - 경미한 용어 스타일 차이. 수정 권장.

Correction:

{{"original": "backup", "suggested": "back up", "reason": "용어집 표준 동사형"}}

Example 3: Score 3 (Borderline)

원문: 24시간 내에 반드시 설치하세요.
번역: You must install within 24 hours guaranteed.
역번역: 24시간 내에 반드시 설치하세요, 보장됨.

Evaluation:

  • Step 1 (Semantic): "guaranteed" 추가됨 - 원문에 없는 의미. 법적 함의 가능성.
  • Step 2 (Terminology): 해당 용어집 항목 없음.
  • Step 3 (Format): 포맷 완전.
  • Score: 3 - 의미 추가 발생. 검수 필요.

Example 4: Score 1 (Severe Error)

원문: 데이터 삭제 후 복구할 수 없습니다.
번역: You can recover your data after deletion.
역번역: 삭제 후 데이터를 복구할 수 있습니다.

Evaluation:

  • Step 1 (Semantic): 의미 완전 반대! "복구 불가" → "복구 가능". 심각한 오역.
  • Step 2 (Terminology): 해당 없음.
  • Step 3 (Format): 해당 없음.
  • Score: 1 - 의미 반전. 사용자 오해 및 데이터 손실 위험.

Output Format

<output_format> Return evaluation results in the following JSON structure:

{{
  "reasoning_chain": [
    "Step 1 (Semantic): [의미 분석 상세 내용]",
    "Step 2 (Terminology): [용어 검증 상세 내용]",
    "Step 3 (Format): [포맷 검증 상세 내용]"
  ],
  "score": 4,
  "verdict": "pass",
  "issues": [
    "발견된 문제점 1",
    "발견된 문제점 2"
  ],
  "corrections": [
    {{
      "original": "현재 문장/단어",
      "suggested": "수정 제안",
      "reason": "수정 이유"
    }}
  ]
}}

Verdict Mapping:

  • Score 5-4: "pass"
  • Score 3: "review"
  • Score 0-2: "fail" </output_format>

Constraints

Success Criteria

<success_criteria>

  • Evaluation is evidence-based, not opinion-based
  • Reasoning chain clearly explains the score
  • Issues are specific and actionable
  • Corrections provide clear improvement path
  • Score accurately reflects translation quality </success_criteria>

Install

Download ZIP
Requires askill CLI v1.0+

AI Quality Score

96/100Analyzed 2/10/2026

An exceptionally well-structured skill for evaluating translation accuracy. It features a rigorous multi-step procedure, a weighted scoring rubric, and high-quality few-shot examples that demonstrate various failure modes.

95
100
90
100
98

Metadata

Licenseunknown
Version-
Updated1/6/2026
Publishergonsoomoon-ml

Tags

prompting