Đơn Mù - Học 3000 Từ Oxford trong 120 Ngày

Design Document Date: 2026-01-09 Version: 1.1 Status: Design Complete - Ready for Implementation Author: Claude (via /brainstorm interactive session)

Constants Reference: Xem definitions.md cho tất cả constants và enums.

Architecture Overview
Component Breakdown - Database Schema
Service Layer Logic
Data Flow & Frontend Integration
Content Preparation & Data Migration
Implementation Checklist

Architecture Overview

Tổng Quan Lộ Trình

Duration: 120 ngày Daily load: 25 từ mới + 10 từ ôn tập = 35 từ/ngày Total time commitment: 20-30 phút/ngày Progression: Tuyến tính theo 3 phases (A1 → A2 → B1)

Scope Clarification

⚠️ IMPORTANT: File này mô tả chi tiết implementation cho RX_MU (Đơn Mù) - prescription tập trung 70% vào từ vựng với 25 từ mới/ngày.

Các prescription khác có daily load khác nhau:

RX_DIEC, RX_CAM, RX_PHAN_XA: 10 từ mới/ngày

RX_FOUNDATION: 15 từ mới/ngày

Xem definitions.md Section 1.2 và Section 8 để biết chi tiết tỷ trọng và constants của từng prescription.

Implementation Note: DailyLessonService cần check prescription_type của user và điều chỉnh new_word_count tương ứng.

Phân Chia Theo Phases

Phase	Days	Level	Words	Focus
Phase 1	1-40	A1	1,000 từ cơ bản	Từ vựng hàng ngày, survival English
Phase 2	41-80	A2	1,000 từ trung bình	Mở rộng topics, simple conversations
Phase 3	81-120	B1	1,000 từ nâng cao	Abstract concepts, opinions

Checkpoint Milestones

Day 40: Test 1,000 từ Phase 1 (Pass: 70% = 700 từ)
Day 80: Test 1,000 từ Phase 2 (Pass: 70% = 700 từ)
Day 120: Final test 500 từ (lấy mẫu) (Pass: 70%)

Content Format

Mỗi từ vựng bao gồm:

Flashcard: Front (English word) / Back (Vietnamese + image)
Audio pronunciation: Azure Speech TTS
Visual aid: Unsplash API (for concrete words)
Example sentence: English + Vietnamese (optional for MVP)

Spaced Repetition

Algorithm: SM-2 (SuperMemo 2) - Industry standard
Intervals: 1 day → 3 days → 7 days → 14 days → 30 days (max: 180 days)
Quality scores: 0-5 (0-2: Fail, 3: Hard, 4-5: Good/Easy)

Component Breakdown - Database Schema

Core Tables

1. `words` Table - Oxford 3000 Vocabulary

words:
  - id (PK)
  - word (string, unique)
  - vietnamese_meaning (text)
  - part_of_speech (enum: noun, verb, adj, adv, etc.)
  - difficulty_level (enum: A1, A2, B1)  -- phase derived: A1=1, A2=2, B1=3
  - day_introduced (int: 1-120)
  - is_concrete (boolean, default: true)  -- NEW: for image decision
  - pronunciation_difficulty (smallint: 1-3, default: 1)  -- NEW: AI coach priority
  - example_sentence_en (text, nullable)
  - example_sentence_vi (text, nullable)
  - image_url (string, nullable)
  - audio_url (string)
  - frequency_rank (int)
  - created_at, updated_at

Key fields explained:

difficulty_level: Source of truth for CEFR level (A1/A2/B1)
is_concrete: Determines if word needs image (concrete nouns = true)
pronunciation_difficulty: Heuristic score for Vietnamese learners (1=easy, 3=hard)
frequency_rank: Oxford frequency ranking (higher = more common)

2. `user_word_progress` Table - SM-2 Tracking

user_word_progress:
  - id (PK)
  - user_id (FK → users)
  - word_id (FK → words)
  - status (enum: new, learning, mastered, forgotten)  -- Keep for readability
  - ease_factor (float, default: 2.5)  -- SM-2 parameter
  - repetition_count (int, default: 0)
  - interval_days (int, default: 1)
  - next_review_date (date)
  - last_reviewed_at (timestamp, nullable)
  - last_quality_score (smallint: 0-5, nullable)  -- NEW: SM-2 quality rating
  - correct_count (int, default: 0)
  - incorrect_count (int, default: 0)
  - created_at, updated_at
  - UNIQUE(user_id, word_id)

SM-2 fields:

ease_factor: Determines interval growth rate (1.3-2.5+)
repetition_count: Number of successful reviews
interval_days: Days until next review (capped at 180)
last_quality_score: User's last self-rating (0-5)

3. `daily_lessons` Table - Pre-structured Lessons

daily_lessons:
  - id (PK)
  - day_number (int: 1-120, unique)
  - level (enum: A1, A2, B1)  -- Removed phase column
  - new_word_ids (jsonb)  -- Array of 25 word IDs (for new words only)
  - estimated_duration_minutes (int, default: 25)
  - created_at, updated_at

Design note:

new_word_ids is a plan, not used for spaced repetition logic
Review words queried from user_word_progress.next_review_date

4. `user_daily_progress` Table - Completion Tracking

user_daily_progress:
  - id (PK)
  - user_id (FK → users)
  - daily_lesson_id (FK → daily_lessons)
  - status (enum: not_started, in_progress, completed)
  - words_learned_count (int)
  - words_reviewed_count (int)
  - accuracy_rate (float)  -- % correct answers
  - time_spent_minutes (int)
  - completed_at (timestamp, nullable)
  - created_at, updated_at
  - UNIQUE(user_id, daily_lesson_id)

5. `checkpoint_tests` Table - Milestone Tests

checkpoint_tests:
  - id (PK)
  - day_number (int: 40, 80, 120)
  - level (enum: A1, A2, B1)  -- Removed phase column
  - test_type (enum: recognition, recall, mixed, default: mixed)  -- NEW: optional
  - total_words_tested (int: 1000 or 3000)
  - pass_threshold (float: 0.70)
  - created_at, updated_at

Test types:

recognition: Multiple choice (show English → choose Vietnamese)
recall: Type answer (show Vietnamese → type English)
mixed: 50/50 combination

6. `user_checkpoint_results` Table

user_checkpoint_results:
  - id (PK)
  - user_id (FK → users)
  - checkpoint_test_id (FK → checkpoint_tests)
  - total_questions (int)
  - correct_answers (int)
  - score_percentage (float)
  - passed (boolean)
  - completed_at (timestamp)
  - created_at, updated_at

Service Layer Logic

1. VocabularyService

Responsibility: Load và organize 3,000 từ Oxford theo phases

class VocabularyService:

    async def get_words_for_day(self, day_number: int) -> List[Word]:
        """Get 25 new words for specific day"""

        # Guard: Day number validation
        if day_number < 1 or day_number > 120:
            raise InvalidDayNumberError(f"Day must be 1-120, got {day_number}")

        words = await word_repo.get_by_day(day_number)

        # Guard: Data integrity check
        if len(words) != 25:
            logger.error(f"Lesson plan corrupted: day {day_number} has {len(words)} words")
            raise LessonPlanCorruptedError(
                f"Expected 25 words for day {day_number}, found {len(words)}"
            )

        return words

    async def get_word_with_assets(self, word_id: int) -> WordDetail:
        """Get word + audio + image URLs"""
        # Returns word with pre-signed URLs for Azure TTS audio and Unsplash image

    async def search_words(
        self,
        level: Optional[str] = None,
        is_concrete: Optional[bool] = None,
        pronunciation_difficulty: Optional[int] = None
    ) -> List[Word]:
        """Filter words by criteria"""

2. SpacedRepetitionService - SM-2 Algorithm

Responsibility: Tính toán intervals và scheduling

class SpacedRepetitionService:

    MAX_INTERVAL_DAYS = 180  # Cap at ~6 months

    async def calculate_next_review(
        self,
        user_word: UserWordProgress,
        quality_score: int  # 0-5 from user feedback
    ) -> UserWordProgress:
        """
        SM-2 Algorithm Implementation

        Quality scores:
        - 0-2: Fail (Chưa nhớ / Sai)
        - 3: Hard (Nhớ mơ hồ)
        - 4-5: Good/Easy (Nhớ rõ)
        """

        # Guard: Validate quality score
        if not 0 <= quality_score <= 5:
            raise ValueError(f"Quality score must be 0-5, got {quality_score}")

        if quality_score < 3:
            # Failed - reset
            user_word.interval_days = 1
            user_word.repetition_count = 0
            user_word.status = 'learning'
        else:
            # Passed
            if user_word.repetition_count == 0:
                user_word.interval_days = 1
            elif user_word.repetition_count == 1:
                user_word.interval_days = 6
            else:
                user_word.interval_days = round(
                    user_word.interval_days * user_word.ease_factor
                )

            # Guard: Cap interval to prevent runaway
            user_word.interval_days = min(user_word.interval_days, self.MAX_INTERVAL_DAYS)

            user_word.repetition_count += 1

            # Update ease factor
            user_word.ease_factor = max(1.3,
                user_word.ease_factor + (0.1 - (5 - quality_score) * (0.08 + (5 - quality_score) * 0.02))
            )

            # Update status
            if user_word.repetition_count >= 3 and user_word.interval_days >= 14:
                user_word.status = 'mastered'

        # Set next review date
        user_word.next_review_date = date.today() + timedelta(days=user_word.interval_days)
        user_word.last_reviewed_at = datetime.now()
        user_word.last_quality_score = quality_score

        return user_word

    async def get_due_reviews(
        self,
        user_id: int,
        limit: int = 10
    ) -> List[UserWordProgress]:
        """
        Get words due for review today, prioritize overdue

        Guard: Order by overdue first to handle backlog gracefully
        """

        due_words = await user_word_repo.get_due_for_review(
            user_id=user_id,
            as_of_date=date.today(),
            order_by='next_review_date ASC',  # Overdue first
            limit=limit
        )

        return due_words

SM-2 Intervals Visualization:

Repetition 0 → 1 day
Repetition 1 → 6 days
Repetition 2 → ~15 days (6 * 2.5 ease_factor)
Repetition 3 → ~38 days
Repetition 4+ → exponential growth (capped at 180 days)

3. DailyLessonService

Responsibility: Tạo bài học hàng ngày (new words + 10 ôn)

class DailyLessonService:

    def _get_new_word_count(self, prescription_type: str) -> int:
        """
        Return daily new word count based on prescription
        Reference: definitions.md Section 8
        """
        WORD_COUNTS = {
            'don_mu': 25,
            'don_diec': 10,
            'don_cam': 10,
            'don_yeu_phan_xa': 10,
            'don_foundation': 15,
        }
        return WORD_COUNTS.get(prescription_type, 25)  # Default to 25

    async def get_lesson_for_user(
        self,
        user_id: int,
        day_number: int
    ) -> DailyLesson:
        """
        Generate daily lesson with deduplication
        
        Word count varies by prescription:
        - RX_MU: 25 new words + 10 review
        - RX_DIEC/CAM/PHAN_XA: 10 new words + 10 review
        - RX_FOUNDATION: 15 new words + 10 review
        """

        # 0. Get user's prescription to determine word count
        user_prescription = await diagnosis_repo.get_latest_prescription(user_id)
        new_word_count = self._get_new_word_count(user_prescription.prescription_type)

        # 1. Get pre-defined lesson plan
        lesson_plan = await daily_lesson_repo.get_by_day(day_number)

        # 2. Get new words (limited by prescription type)
        new_words = await vocab_service.get_words_by_ids(
            lesson_plan.new_word_ids[:new_word_count]
        )
        new_word_ids_set = set(lesson_plan.new_word_ids[:new_word_count])

        # 3. Get up to 10 review words (due today or overdue)
        review_candidates = await spaced_repetition_service.get_due_reviews(
            user_id=user_id,
            limit=15  # Get extra to account for dedup
        )

        # Guard: Deduplicate - exclude words in new_words
        review_words = [
            word for word in review_candidates
            if word.word_id not in new_word_ids_set
        ][:10]  # Take first 10 after dedup

        # 4. Combine and return
        return DailyLesson(
            day_number=day_number,
            new_words=new_words,
            review_words=review_words,
            total_words=len(new_words) + len(review_words),
            estimated_duration=25
        )

    async def record_word_interaction(
        self,
        user_id: int,
        word_id: int,
        quality_score: int,
        is_new_word: bool
    ):
        """Record user interaction with idempotency check"""

        user_word = await user_word_repo.get_or_create(user_id, word_id)

        # Guard: Prevent double updates on same day
        if user_word.last_reviewed_at:
            last_review_date = user_word.last_reviewed_at.date()
            if last_review_date == date.today():
                logger.warning(
                    f"Word {word_id} already reviewed today by user {user_id}, skipping"
                )
                # Return existing state (idempotent response)
                return {
                    "success": True,
                    "skipped": True,
                    "message": "Từ này đã được ghi nhận hôm nay",
                    "next_review_date": user_word.next_review_date,
                    "status": user_word.status
                }

        if is_new_word:
            user_word.status = 'learning'
            user_word.repetition_count = 0

        # Update counts
        if quality_score >= 3:
            user_word.correct_count += 1
        else:
            user_word.incorrect_count += 1

        # Calculate next review using SM-2
        user_word = await spaced_repetition_service.calculate_next_review(
            user_word, quality_score
        )

        await user_word_repo.update(user_word)

        return {
            "success": True,
            "skipped": False,
            "next_review_date": user_word.next_review_date,
            "interval_days": user_word.interval_days,
            "status": user_word.status
        }

4. CheckpointTestService

Responsibility: Tạo và chấm tests ở ngày 40, 80, 120

class CheckpointTestService:

    async def generate_test(
        self,
        checkpoint_day: int,
        user_id: int
    ) -> CheckpointTest:
        """
        Generate checkpoint test with smart sampling

        Day 40: Test 1,000 words from A1 (days 1-40)
        Day 80: Test 1,000 words from A2 (days 41-80)
        Day 120: Sample 500 words from ALL 3,000 (weighted)
        """

        checkpoint_config = await checkpoint_repo.get_by_day(checkpoint_day)

        if checkpoint_day == 120:
            # Guard: Don't test all 3,000 - sample strategically
            word_pool = await self._get_weighted_sample(
                user_id=user_id,
                target_count=500,
                weights={
                    'failed_before': 0.60,  # 300 words
                    'learning': 0.30,        # 150 words
                    'mastered': 0.10         # 50 words
                }
            )
        else:
            # Phase-specific test (test all 1,000)
            level = 'A1' if checkpoint_day == 40 else 'A2'
            word_pool = await vocab_service.get_words_by_level([level])

        test_questions = self._generate_questions(
            words=word_pool,
            test_type=checkpoint_config.test_type,
            count=len(word_pool)
        )

        return CheckpointTest(
            day_number=checkpoint_day,
            questions=test_questions,
            pass_threshold=checkpoint_config.pass_threshold
        )

    async def _get_weighted_sample(
        self,
        user_id: int,
        target_count: int,
        weights: Dict[str, float]
    ) -> List[Word]:
        """Sample words based on user's learning history"""

        failed_words = await user_word_repo.get_by_criteria(
            user_id=user_id,
            incorrect_count__gt=2
        )

        learning_words = await user_word_repo.get_by_status(
            user_id=user_id,
            status='learning'
        )

        mastered_words = await user_word_repo.get_by_status(
            user_id=user_id,
            status='mastered'
        )

        # Sample according to weights
        sample = []
        sample.extend(random.sample(failed_words, min(len(failed_words), int(target_count * weights['failed_before']))))
        sample.extend(random.sample(learning_words, min(len(learning_words), int(target_count * weights['learning']))))
        sample.extend(random.sample(mastered_words, min(len(mastered_words), int(target_count * weights['mastered']))))

        return sample

    async def grade_test(
        self,
        user_id: int,
        checkpoint_test_id: int,
        answers: List[Answer]
    ) -> CheckpointResult:
        """Grade test and determine pass/fail"""

        correct = sum(1 for ans in answers if ans.is_correct)
        total = len(answers)
        score = correct / total

        passed = score >= 0.70  # 70% threshold

        result = UserCheckpointResult(
            user_id=user_id,
            checkpoint_test_id=checkpoint_test_id,
            total_questions=total,
            correct_answers=correct,
            score_percentage=score * 100,
            passed=passed
        )

        await checkpoint_result_repo.create(result)

        return result

5. ProgressTrackingService

Responsibility: Theo dõi tiến độ tổng quan

class ProgressTrackingService:

    async def get_user_stats(self, user_id: int) -> UserStats:
        """Get comprehensive statistics"""

        total_learned = await user_word_repo.count_by_status(
            user_id, ['learning', 'mastered']
        )

        mastered = await user_word_repo.count_by_status(
            user_id, ['mastered']
        )

        current_streak = await self._calculate_streak(user_id)

        return UserStats(
            words_learned=total_learned,
            words_mastered=mastered,
            completion_percentage=(total_learned / 3000) * 100,  # Marketing metric
            true_mastery_percentage=(mastered / max(total_learned, 1)) * 100,  # Real metric
            current_day=current_streak.current_day,
            streak_days=current_streak.consecutive_days
        )

Edge Cases Handled

Service	Edge Case	Solution
VocabularyService	Day > 120 or < 1	Fail fast with `InvalidDayNumberError`
VocabularyService	Missing 25 words	Throw `LessonPlanCorruptedError`
SpacedRepetitionService	interval_days runaway	Cap at 180 days
DailyLessonService	User nghỉ 5-7 ngày → backlog	Limit to 10 reviews, overdue first
DailyLessonService	Word in both new + review	Deduplicate (exclude from review)
DailyLessonService	Double click / refresh	Idempotent (check `last_reviewed_at.date`)
CheckpointTestService	Day 120 → 3,000 words	Sample 500 weighted by learning status

Data Flow & Frontend Integration

API Endpoints

Base path: /api/v1/vocabulary

Daily Lesson Endpoints

Get daily lesson:

GET /api/v1/vocabulary/daily-lesson/{day_number}

Response (200 OK):
{
  "day_number": 15,
  "level": "A1",  // Source: daily_lessons.level column (pre-defined in DB)
                  // Frontend: NEVER derive from day_number yourself
  "new_words": [
    {
      "id": 123,
      "word": "apple",
      "vietnamese_meaning": "quả táo",
      "part_of_speech": "noun",
      "is_concrete": true,
      "pronunciation_difficulty": 1,
      "audio_url": "https://azure.blob/apple_noun.mp3",
      "image_url": "https://unsplash.com/apple-xyz",
      "example_sentence_en": "I eat an apple every day",
      "example_sentence_vi": "Tôi ăn một quả táo mỗi ngày"
    }
    // ... 24 more
  ],
  "review_words": [
    {
      "id": 45,
      "word": "hello",
      // ... same structure
      "user_progress": {
        "repetition_count": 2,
        "interval_days": 6,
        "last_quality_score": 4,
        "status": "learning"
      }
    }
    // ... up to 9 more
  ],
  "total_words": 35,
  "estimated_duration_minutes": 25
}

Submit word interaction:

POST /api/v1/vocabulary/word-interaction

Request:
{
  "word_id": 123,
  "quality_score": 5,  // 0-5
  "is_new_word": true,
  "time_spent_seconds": 15
}

Response Case 1 - First interaction (200 OK):
{
  "success": true,
  "next_review_date": "2026-01-20",
  "interval_days": 1,
  "status": "learning",
  "message": "Tuyệt vời! Từ này sẽ xuất hiện lại vào ngày 20/01"
}

Response Case 2 - Already reviewed today (200 OK):
{
  "success": true,
  "skipped": true,  // Idempotent flag
  "message": "Từ này đã được ghi nhận hôm nay",
  "next_review_date": "2026-01-20",
  "status": "learning"
}

Complete daily lesson:

POST /api/v1/vocabulary/daily-lesson/{day_number}/complete

Request:
{
  "words_learned": 25,
  "words_reviewed": 10,
  "accuracy_rate": 0.85,
  "time_spent_minutes": 28
}

Response (200 OK):
{
  "completed": true,
  "streak_days": 15,
  "next_day_unlocked": 16,
  "encouragement_message": "Tuyệt vời! Bạn đã hoàn thành ngày 15. Hẹn gặp bạn ngày mai!"
}

Progress Endpoints

GET /api/v1/vocabulary/stats

Response (200 OK):
{
  "words_learned": 375,
  "words_mastered": 120,
  "current_day": 15,
  "streak_days": 15,
  "completion_percentage": 12.5,
  "true_mastery_percentage": 32.0,
  "next_checkpoint": {
    "day": 40,
    "words_remaining": 625
  }
}

Checkpoint Test Endpoints

Start test:

POST /api/v1/vocabulary/checkpoint/{day_number}/start

Response (200 OK):
{
  "test_id": "ckpt_40_user123_xyz",
  "day_number": 40,
  "total_questions": 1000,
  "page_size": 50,  // Paginated to prevent frontend lag
  "total_pages": 20,
  "test_type": "mixed",
  "pass_threshold": 0.70,
  "questions": [
    // First 50 questions
  ],
  "pagination": {
    "current_page": 1,
    "has_next": true,
    "next_page_url": "/api/v1/vocabulary/checkpoint/40/questions?page=2&test_id=..."
  }
}

Submit test:

POST /api/v1/vocabulary/checkpoint/{day_number}/submit

Request:
{
  "test_id": "ckpt_40_user123_xyz",
  "answers": [
    { "question_id": "q1", "answer": "quả táo" }
    // ... all answers
  ]
}

Response (200 OK):
{
  "total_questions": 1000,
  "correct_answers": 750,
  "score_percentage": 75.0,
  "passed": true,
  "pass_threshold": 70.0,
  "message": "Chúc mừng! Bạn đã vượt qua checkpoint A1 với 75%",
  "next_phase_unlocked": "A2",
  "weak_words": [
    // Words answered incorrectly
  ]
}

Frontend State Management

LessonStore (with interaction locks):

interface LessonState {
  currentDay: number;
  newWords: Word[];
  reviewWords: Word[];
  currentWordIndex: number;

  // Interaction locks
  submittedWords: Set<number>;
  isSubmitting: boolean;
  lastSubmitError: Error | null;

  // Actions
  loadLesson: (dayNumber: number) => Promise<void>;
  submitWordInteraction: (wordId: number, qualityScore: number) => Promise<void>;
  canSubmitWord: (wordId: number) => boolean;
  nextWord: () => void;
  completeLesson: () => Promise<void>;
}

ProgressStore (single source of truth):

interface ProgressState {
  stats: UserStats;
  wordsLearned: Map<number, UserWordProgress>;
  currentStreak: number;

  // Actions
  fetchStats: () => Promise<void>;
  updateWordProgress: (wordId: number, backendResponse: any) => void;

  // DON'T: self-increment counters (causes drift from backend)
}

Frontend Developer Rules

/**
 * RULE 1: Never derive level from day_number
 * ✅ GOOD: const level = lessonData.level;  // Trust backend
 * ❌ BAD:  const level = dayNumber <= 40 ? 'A1' : ...;
 */

/**
 * RULE 2: Handle duplicate interactions gracefully
 * Check response.skipped flag, not error codes
 */
if (response.skipped) {
  disableRatingUI(wordId);
  toast.info(response.message);  // Gentle, not error
}

/**
 * RULE 3: Pagination for checkpoints
 * Load first 5 pages (250 questions) initially
 */

/**
 * RULE 4: Interaction locks
 * Always check canSubmitWord() before allowing user action
 */

/**
 * RULE 5: Single source of truth
 * Always update state from backend responses only
 */

Content Preparation & Data Migration

Oxford 3000 Word List Sourcing

Primary source: Oxford Learner's Dictionaries API

# ASSUMPTION CLARIFICATION:
# Oxford API endpoint /wordlist/en-gb/oxford3000
# returns COMPLETE list in ONE request (not 3000 individual requests)

async def fetch_oxford_3000():
    """
    Fetch Oxford 3000 word list

    Free tier: 1,000 requests/month
    Strategy: Fetch wordlist once, enrich top 500 words only
    """

    endpoint = "https://od-api.oxforddictionaries.com/api/v2/wordlist/en-gb/oxford3000"

    headers = {
        'app_id': OXFORD_APP_ID,
        'app_key': OXFORD_API_KEY
    }

    response = requests.get(endpoint, headers=headers)
    words = response.json()['results']

    logger.info(f"Fetched {len(words)} words in single request")
    return words

Expected distribution:

Oxford 3000 breakdown:
- A1: ~500-700 words
- A2: ~800-1000 words
- B1: ~1200-1500 words
- B2+: ~800 words (exclude)

Target: Balance to exactly 1,000 per level (A1, A2, B1)

Daily Distribution Strategy

Assigning day_introduced (1-120):

async def assign_day_introduced(categorized: Dict) -> List[Dict]:
    """
    Assign day_introduced based on:
    1. Frequency (most common first)
    2. Concreteness (concrete nouns before abstract)
    3. Part of speech (nouns → verbs → adjectives)
    """

    all_words = []
    day_counter = 1

    # Phase 1: A1 words (days 1-40)
    a1_words = categorized['A1']
    a1_words.sort(key=lambda w: (
        -w.get('frequency', 0),           # Higher frequency first
        w['part_of_speech'] != 'noun',    # Nouns first
        not is_concrete(w)                 # Concrete first
    ))

    for i, word in enumerate(a1_words):
        word['day_introduced'] = day_counter + (i // 25)
        word['difficulty_level'] = 'A1'
        all_words.append(word)

    # ... repeat for A2, B1

    return all_words

Audio Generation (Azure TTS)

Strategy: Pre-generate all 3,000 audio files

async def generate_audio_for_word(word: str, output_path: str):
    """
    Generate audio using Azure TTS

    Voice: en-US-JennyNeural (female, clear)
    Format: MP3, 48kHz
    """

    speech_config = SpeechConfig(
        subscription=AZURE_SPEECH_KEY,
        region=AZURE_REGION
    )

    speech_config.speech_synthesis_voice_name = "en-US-JennyNeural"

    # SSML for better pronunciation
    ssml = f"""
    <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-US">
        <voice name="en-US-JennyNeural">
            <prosody rate="0.9">
                {word}
            </prosody>
        </voice>
    </speak>
    """

    # ... synthesize and upload to Azure Blob Storage

File naming (handles homonyms):

def generate_audio_filename(word: Dict) -> str:
    """
    Format: {word}_{pos}.mp3

    Examples:
    - run_verb.mp3
    - run_noun.mp3
    - book_noun.mp3

    Prevents collision for words with multiple parts of speech
    """
    word_text = word['word'].lower().replace(' ', '_')
    pos = word['part_of_speech'][:4]

    return f"{word_text}_{pos}.mp3"

Cost:

Azure TTS Neural Voice: $16 per 1M characters
3,000 words × avg 8 characters = 24,000 characters
Cost: $16 × 0.000024 = $0.38 (one-time)

✅ Extremely affordable for pre-generation

Image Sourcing (Unsplash API)

Strategy: Fetch only for concrete words (~1,500 images)

async def fetch_image_for_word(word: str) -> str | None:
    """
    Fetch image from Unsplash API
    Only for concrete words (is_concrete = true)
    """

    endpoint = "https://api.unsplash.com/search/photos"
    params = {
        'query': word,
        'per_page': 1,
        'orientation': 'squarish',
        'content_filter': 'high'
    }

    response = requests.get(endpoint, params=params, headers=headers)

    if response.status_code == 200:
        results = response.json()['results']
        if results:
            return results[0]['urls']['regular']

    return None

Rate limits:

Demo (Free) tier: 50 requests/hour
~1,500 concrete words ÷ 50 = 30 hours to fetch all

Production tier: 5,000 requests/hour
Can fetch all in <1 hour

✅ Use demo tier for MVP, upgrade if needed

Pronunciation Difficulty Tagging

Heuristic-based (not phonetic science):

def calculate_pronunciation_difficulty(word: Dict) -> int:
    """
    HEURISTIC score (1-3) for Vietnamese learners

    ⚠️ IMPORTANT: This is NOT phonetic analysis
    - Based on common Vietnamese learner patterns
    - Use as guidance, not absolute truth

    Difficulty factors:
    - /θ/ /ð/ (th): not in Vietnamese → 3
    - /v/ /w/: often confused → 2
    - Consonant clusters: difficult → 3
    """

    word_text = word['word'].lower()
    difficulty = 1

    if 'th' in word_text:
        difficulty = max(difficulty, 3)

    if word_text.startswith('v') or word_text.startswith('w'):
        difficulty = max(difficulty, 2)

    # ... more heuristics

    return difficulty

Database Migration Script

async def seed_words_table():
    """
    One-time migration to populate words table

    Workflow:
    1. Fetch Oxford 3000 from API (1 request)
    2. Filter to A1-B1 (3,000 words)
    3. Assign day_introduced (1-120)
    4. Generate audio files (Azure TTS)
    5. Fetch images (Unsplash)
    6. Calculate pronunciation_difficulty
    7. Insert into database
    """

    logger.info("Step 1: Fetching Oxford 3000...")
    raw_words = await fetch_oxford_3000()

    logger.info("Step 2: Filtering to A1-B1...")
    categorized = await filter_by_cefr(raw_words)
    balanced = await balance_to_1000_per_level(categorized)

    logger.info("Step 3: Assigning day_introduced...")
    words_with_days = await assign_day_introduced(balanced)

    logger.info("Step 4: Generating audio files...")
    words_with_audio = await batch_generate_audio(words_with_days)

    logger.info("Step 5: Fetching images...")
    words_with_images = await batch_fetch_images(words_with_audio)

    logger.info("Step 6: Calculating pronunciation difficulty...")
    for word in words_with_images:
        word['pronunciation_difficulty'] = calculate_pronunciation_difficulty(word)
        word['is_concrete'] = is_concrete(word)

    logger.info("Step 7: Inserting into database...")
    await word_repository.bulk_insert(words_with_images)

    logger.info("✅ Migration complete! 3,000 words seeded.")

Data Validation

async def validate_seeded_data():
    """Validate data integrity after migration"""

    # Check 1: Total count
    total = await word_repository.count()
    assert total == 3000, f"Expected 3000 words, found {total}"

    # Check 2: Distribution per level
    a1_count = await word_repository.count_by_level('A1')
    a2_count = await word_repository.count_by_level('A2')
    b1_count = await word_repository.count_by_level('B1')

    assert a1_count == 1000
    assert a2_count == 1000
    assert b1_count == 1000

    # Check 3: Each day has exactly 25 words
    for day in range(1, 121):
        day_words = await word_repository.get_by_day(day)
        assert len(day_words) == 25

    # Check 4: Audio URLs present
    missing_audio = await word_repository.count_where(audio_url=None)
    if missing_audio > 0:
        logger.warning(f"{missing_audio} words missing audio URLs")

    # Check 5: No duplicates
    duplicates = await word_repository.find_duplicates()
    assert len(duplicates) == 0

    logger.info("✅ All validation checks passed")

Migration Timeline & Cost

Timeline:

Week 1: Data sourcing
- Day 1-2: Obtain Oxford 3000 list
- Day 3-4: Filter, categorize, assign day_introduced
- Day 5: Manual review

Week 2: Asset generation
- Day 1-2: Generate 3,000 audio files
- Day 3-5: Fetch ~1,500 images
- Day 6-7: Tag pronunciation_difficulty

Week 3: Migration & QA
- Day 1: Run migration script
- Day 2-3: Validation & QA
- Day 4-7: Fix issues, fill gaps

Total: ~3 weeks