Đơn Mù - Học 3000 Từ Oxford trong 120 Ngày
Design Document Date: 2026-01-09 Version: 1.1 Status: Design Complete - Ready for Implementation Author: Claude (via /brainstorm interactive session)
Constants Reference: Xem
definitions.mdcho tất cả constants và enums.
Table of Contents
- Architecture Overview
- Component Breakdown - Database Schema
- Service Layer Logic
- Data Flow & Frontend Integration
- Content Preparation & Data Migration
- Implementation Checklist
Architecture Overview
Tổng Quan Lộ Trình
Duration: 120 ngày Daily load: 25 từ mới + 10 từ ôn tập = 35 từ/ngày Total time commitment: 20-30 phút/ngày Progression: Tuyến tính theo 3 phases (A1 → A2 → B1)
Scope Clarification
⚠️ IMPORTANT: File này mô tả chi tiết implementation cho RX_MU (Đơn Mù) - prescription tập trung 70% vào từ vựng với 25 từ mới/ngày.
Các prescription khác có daily load khác nhau:
- RX_DIEC, RX_CAM, RX_PHAN_XA: 10 từ mới/ngày
- RX_FOUNDATION: 15 từ mới/ngày
Xem
definitions.mdSection 1.2 và Section 8 để biết chi tiết tỷ trọng và constants của từng prescription.Implementation Note:
DailyLessonServicecần checkprescription_typecủa user và điều chỉnhnew_word_counttương ứng.
Phân Chia Theo Phases
| Phase | Days | Level | Words | Focus |
|---|---|---|---|---|
| Phase 1 | 1-40 | A1 | 1,000 từ cơ bản | Từ vựng hàng ngày, survival English |
| Phase 2 | 41-80 | A2 | 1,000 từ trung bình | Mở rộng topics, simple conversations |
| Phase 3 | 81-120 | B1 | 1,000 từ nâng cao | Abstract concepts, opinions |
Checkpoint Milestones
- Day 40: Test 1,000 từ Phase 1 (Pass: 70% = 700 từ)
- Day 80: Test 1,000 từ Phase 2 (Pass: 70% = 700 từ)
- Day 120: Final test 500 từ (lấy mẫu) (Pass: 70%)
Content Format
Mỗi từ vựng bao gồm:
- Flashcard: Front (English word) / Back (Vietnamese + image)
- Audio pronunciation: Azure Speech TTS
- Visual aid: Unsplash API (for concrete words)
- Example sentence: English + Vietnamese (optional for MVP)
Spaced Repetition
- Algorithm: SM-2 (SuperMemo 2) - Industry standard
- Intervals: 1 day → 3 days → 7 days → 14 days → 30 days (max: 180 days)
- Quality scores: 0-5 (0-2: Fail, 3: Hard, 4-5: Good/Easy)
Component Breakdown - Database Schema
Core Tables
1. words Table - Oxford 3000 Vocabulary
words:
- id (PK)
- word (string, unique)
- vietnamese_meaning (text)
- part_of_speech (enum: noun, verb, adj, adv, etc.)
- difficulty_level (enum: A1, A2, B1) -- phase derived: A1=1, A2=2, B1=3
- day_introduced (int: 1-120)
- is_concrete (boolean, default: true) -- NEW: for image decision
- pronunciation_difficulty (smallint: 1-3, default: 1) -- NEW: AI coach priority
- example_sentence_en (text, nullable)
- example_sentence_vi (text, nullable)
- image_url (string, nullable)
- audio_url (string)
- frequency_rank (int)
- created_at, updated_at
Key fields explained:
difficulty_level: Source of truth for CEFR level (A1/A2/B1)is_concrete: Determines if word needs image (concrete nouns = true)pronunciation_difficulty: Heuristic score for Vietnamese learners (1=easy, 3=hard)frequency_rank: Oxford frequency ranking (higher = more common)
2. user_word_progress Table - SM-2 Tracking
user_word_progress:
- id (PK)
- user_id (FK → users)
- word_id (FK → words)
- status (enum: new, learning, mastered, forgotten) -- Keep for readability
- ease_factor (float, default: 2.5) -- SM-2 parameter
- repetition_count (int, default: 0)
- interval_days (int, default: 1)
- next_review_date (date)
- last_reviewed_at (timestamp, nullable)
- last_quality_score (smallint: 0-5, nullable) -- NEW: SM-2 quality rating
- correct_count (int, default: 0)
- incorrect_count (int, default: 0)
- created_at, updated_at
- UNIQUE(user_id, word_id)
SM-2 fields:
ease_factor: Determines interval growth rate (1.3-2.5+)repetition_count: Number of successful reviewsinterval_days: Days until next review (capped at 180)last_quality_score: User's last self-rating (0-5)
3. daily_lessons Table - Pre-structured Lessons
daily_lessons:
- id (PK)
- day_number (int: 1-120, unique)
- level (enum: A1, A2, B1) -- Removed phase column
- new_word_ids (jsonb) -- Array of 25 word IDs (for new words only)
- estimated_duration_minutes (int, default: 25)
- created_at, updated_at
Design note:
new_word_idsis a plan, not used for spaced repetition logic- Review words queried from
user_word_progress.next_review_date
4. user_daily_progress Table - Completion Tracking
user_daily_progress:
- id (PK)
- user_id (FK → users)
- daily_lesson_id (FK → daily_lessons)
- status (enum: not_started, in_progress, completed)
- words_learned_count (int)
- words_reviewed_count (int)
- accuracy_rate (float) -- % correct answers
- time_spent_minutes (int)
- completed_at (timestamp, nullable)
- created_at, updated_at
- UNIQUE(user_id, daily_lesson_id)
5. checkpoint_tests Table - Milestone Tests
checkpoint_tests:
- id (PK)
- day_number (int: 40, 80, 120)
- level (enum: A1, A2, B1) -- Removed phase column
- test_type (enum: recognition, recall, mixed, default: mixed) -- NEW: optional
- total_words_tested (int: 1000 or 3000)
- pass_threshold (float: 0.70)
- created_at, updated_at
Test types:
recognition: Multiple choice (show English → choose Vietnamese)recall: Type answer (show Vietnamese → type English)mixed: 50/50 combination
6. user_checkpoint_results Table
user_checkpoint_results:
- id (PK)
- user_id (FK → users)
- checkpoint_test_id (FK → checkpoint_tests)
- total_questions (int)
- correct_answers (int)
- score_percentage (float)
- passed (boolean)
- completed_at (timestamp)
- created_at, updated_at
Service Layer Logic
1. VocabularyService
Responsibility: Load và organize 3,000 từ Oxford theo phases
class VocabularyService:
async def get_words_for_day(self, day_number: int) -> List[Word]:
"""Get 25 new words for specific day"""
# Guard: Day number validation
if day_number < 1 or day_number > 120:
raise InvalidDayNumberError(f"Day must be 1-120, got {day_number}")
words = await word_repo.get_by_day(day_number)
# Guard: Data integrity check
if len(words) != 25:
logger.error(f"Lesson plan corrupted: day {day_number} has {len(words)} words")
raise LessonPlanCorruptedError(
f"Expected 25 words for day {day_number}, found {len(words)}"
)
return words
async def get_word_with_assets(self, word_id: int) -> WordDetail:
"""Get word + audio + image URLs"""
# Returns word with pre-signed URLs for Azure TTS audio and Unsplash image
async def search_words(
self,
level: Optional[str] = None,
is_concrete: Optional[bool] = None,
pronunciation_difficulty: Optional[int] = None
) -> List[Word]:
"""Filter words by criteria"""
2. SpacedRepetitionService - SM-2 Algorithm
Responsibility: Tính toán intervals và scheduling
class SpacedRepetitionService:
MAX_INTERVAL_DAYS = 180 # Cap at ~6 months
async def calculate_next_review(
self,
user_word: UserWordProgress,
quality_score: int # 0-5 from user feedback
) -> UserWordProgress:
"""
SM-2 Algorithm Implementation
Quality scores:
- 0-2: Fail (Chưa nhớ / Sai)
- 3: Hard (Nhớ mơ hồ)
- 4-5: Good/Easy (Nhớ rõ)
"""
# Guard: Validate quality score
if not 0 <= quality_score <= 5:
raise ValueError(f"Quality score must be 0-5, got {quality_score}")
if quality_score < 3:
# Failed - reset
user_word.interval_days = 1
user_word.repetition_count = 0
user_word.status = 'learning'
else:
# Passed
if user_word.repetition_count == 0:
user_word.interval_days = 1
elif user_word.repetition_count == 1:
user_word.interval_days = 6
else:
user_word.interval_days = round(
user_word.interval_days * user_word.ease_factor
)
# Guard: Cap interval to prevent runaway
user_word.interval_days = min(user_word.interval_days, self.MAX_INTERVAL_DAYS)
user_word.repetition_count += 1
# Update ease factor
user_word.ease_factor = max(1.3,
user_word.ease_factor + (0.1 - (5 - quality_score) * (0.08 + (5 - quality_score) * 0.02))
)
# Update status
if user_word.repetition_count >= 3 and user_word.interval_days >= 14:
user_word.status = 'mastered'
# Set next review date
user_word.next_review_date = date.today() + timedelta(days=user_word.interval_days)
user_word.last_reviewed_at = datetime.now()
user_word.last_quality_score = quality_score
return user_word
async def get_due_reviews(
self,
user_id: int,
limit: int = 10
) -> List[UserWordProgress]:
"""
Get words due for review today, prioritize overdue
Guard: Order by overdue first to handle backlog gracefully
"""
due_words = await user_word_repo.get_due_for_review(
user_id=user_id,
as_of_date=date.today(),
order_by='next_review_date ASC', # Overdue first
limit=limit
)
return due_words
SM-2 Intervals Visualization:
Repetition 0 → 1 day
Repetition 1 → 6 days
Repetition 2 → ~15 days (6 * 2.5 ease_factor)
Repetition 3 → ~38 days
Repetition 4+ → exponential growth (capped at 180 days)
3. DailyLessonService
Responsibility: Tạo bài học hàng ngày (new words + 10 ôn)
class DailyLessonService:
def _get_new_word_count(self, prescription_type: str) -> int:
"""
Return daily new word count based on prescription
Reference: definitions.md Section 8
"""
WORD_COUNTS = {
'don_mu': 25,
'don_diec': 10,
'don_cam': 10,
'don_yeu_phan_xa': 10,
'don_foundation': 15,
}
return WORD_COUNTS.get(prescription_type, 25) # Default to 25
async def get_lesson_for_user(
self,
user_id: int,
day_number: int
) -> DailyLesson:
"""
Generate daily lesson with deduplication
Word count varies by prescription:
- RX_MU: 25 new words + 10 review
- RX_DIEC/CAM/PHAN_XA: 10 new words + 10 review
- RX_FOUNDATION: 15 new words + 10 review
"""
# 0. Get user's prescription to determine word count
user_prescription = await diagnosis_repo.get_latest_prescription(user_id)
new_word_count = self._get_new_word_count(user_prescription.prescription_type)
# 1. Get pre-defined lesson plan
lesson_plan = await daily_lesson_repo.get_by_day(day_number)
# 2. Get new words (limited by prescription type)
new_words = await vocab_service.get_words_by_ids(
lesson_plan.new_word_ids[:new_word_count]
)
new_word_ids_set = set(lesson_plan.new_word_ids[:new_word_count])
# 3. Get up to 10 review words (due today or overdue)
review_candidates = await spaced_repetition_service.get_due_reviews(
user_id=user_id,
limit=15 # Get extra to account for dedup
)
# Guard: Deduplicate - exclude words in new_words
review_words = [
word for word in review_candidates
if word.word_id not in new_word_ids_set
][:10] # Take first 10 after dedup
# 4. Combine and return
return DailyLesson(
day_number=day_number,
new_words=new_words,
review_words=review_words,
total_words=len(new_words) + len(review_words),
estimated_duration=25
)
async def record_word_interaction(
self,
user_id: int,
word_id: int,
quality_score: int,
is_new_word: bool
):
"""Record user interaction with idempotency check"""
user_word = await user_word_repo.get_or_create(user_id, word_id)
# Guard: Prevent double updates on same day
if user_word.last_reviewed_at:
last_review_date = user_word.last_reviewed_at.date()
if last_review_date == date.today():
logger.warning(
f"Word {word_id} already reviewed today by user {user_id}, skipping"
)
# Return existing state (idempotent response)
return {
"success": True,
"skipped": True,
"message": "Từ này đã được ghi nhận hôm nay",
"next_review_date": user_word.next_review_date,
"status": user_word.status
}
if is_new_word:
user_word.status = 'learning'
user_word.repetition_count = 0
# Update counts
if quality_score >= 3:
user_word.correct_count += 1
else:
user_word.incorrect_count += 1
# Calculate next review using SM-2
user_word = await spaced_repetition_service.calculate_next_review(
user_word, quality_score
)
await user_word_repo.update(user_word)
return {
"success": True,
"skipped": False,
"next_review_date": user_word.next_review_date,
"interval_days": user_word.interval_days,
"status": user_word.status
}
4. CheckpointTestService
Responsibility: Tạo và chấm tests ở ngày 40, 80, 120
class CheckpointTestService:
async def generate_test(
self,
checkpoint_day: int,
user_id: int
) -> CheckpointTest:
"""
Generate checkpoint test with smart sampling
Day 40: Test 1,000 words from A1 (days 1-40)
Day 80: Test 1,000 words from A2 (days 41-80)
Day 120: Sample 500 words from ALL 3,000 (weighted)
"""
checkpoint_config = await checkpoint_repo.get_by_day(checkpoint_day)
if checkpoint_day == 120:
# Guard: Don't test all 3,000 - sample strategically
word_pool = await self._get_weighted_sample(
user_id=user_id,
target_count=500,
weights={
'failed_before': 0.60, # 300 words
'learning': 0.30, # 150 words
'mastered': 0.10 # 50 words
}
)
else:
# Phase-specific test (test all 1,000)
level = 'A1' if checkpoint_day == 40 else 'A2'
word_pool = await vocab_service.get_words_by_level([level])
test_questions = self._generate_questions(
words=word_pool,
test_type=checkpoint_config.test_type,
count=len(word_pool)
)
return CheckpointTest(
day_number=checkpoint_day,
questions=test_questions,
pass_threshold=checkpoint_config.pass_threshold
)
async def _get_weighted_sample(
self,
user_id: int,
target_count: int,
weights: Dict[str, float]
) -> List[Word]:
"""Sample words based on user's learning history"""
failed_words = await user_word_repo.get_by_criteria(
user_id=user_id,
incorrect_count__gt=2
)
learning_words = await user_word_repo.get_by_status(
user_id=user_id,
status='learning'
)
mastered_words = await user_word_repo.get_by_status(
user_id=user_id,
status='mastered'
)
# Sample according to weights
sample = []
sample.extend(random.sample(failed_words, min(len(failed_words), int(target_count * weights['failed_before']))))
sample.extend(random.sample(learning_words, min(len(learning_words), int(target_count * weights['learning']))))
sample.extend(random.sample(mastered_words, min(len(mastered_words), int(target_count * weights['mastered']))))
return sample
async def grade_test(
self,
user_id: int,
checkpoint_test_id: int,
answers: List[Answer]
) -> CheckpointResult:
"""Grade test and determine pass/fail"""
correct = sum(1 for ans in answers if ans.is_correct)
total = len(answers)
score = correct / total
passed = score >= 0.70 # 70% threshold
result = UserCheckpointResult(
user_id=user_id,
checkpoint_test_id=checkpoint_test_id,
total_questions=total,
correct_answers=correct,
score_percentage=score * 100,
passed=passed
)
await checkpoint_result_repo.create(result)
return result
5. ProgressTrackingService
Responsibility: Theo dõi tiến độ tổng quan
class ProgressTrackingService:
async def get_user_stats(self, user_id: int) -> UserStats:
"""Get comprehensive statistics"""
total_learned = await user_word_repo.count_by_status(
user_id, ['learning', 'mastered']
)
mastered = await user_word_repo.count_by_status(
user_id, ['mastered']
)
current_streak = await self._calculate_streak(user_id)
return UserStats(
words_learned=total_learned,
words_mastered=mastered,
completion_percentage=(total_learned / 3000) * 100, # Marketing metric
true_mastery_percentage=(mastered / max(total_learned, 1)) * 100, # Real metric
current_day=current_streak.current_day,
streak_days=current_streak.consecutive_days
)
Edge Cases Handled
| Service | Edge Case | Solution |
|---|---|---|
| VocabularyService | Day > 120 or < 1 | Fail fast with InvalidDayNumberError |
| VocabularyService | Missing 25 words | Throw LessonPlanCorruptedError |
| SpacedRepetitionService | interval_days runaway | Cap at 180 days |
| DailyLessonService | User nghỉ 5-7 ngày → backlog | Limit to 10 reviews, overdue first |
| DailyLessonService | Word in both new + review | Deduplicate (exclude from review) |
| DailyLessonService | Double click / refresh | Idempotent (check last_reviewed_at.date) |
| CheckpointTestService | Day 120 → 3,000 words | Sample 500 weighted by learning status |
Data Flow & Frontend Integration
API Endpoints
Base path: /api/v1/vocabulary
Daily Lesson Endpoints
Get daily lesson:
GET /api/v1/vocabulary/daily-lesson/{day_number}
Response (200 OK):
{
"day_number": 15,
"level": "A1", // Source: daily_lessons.level column (pre-defined in DB)
// Frontend: NEVER derive from day_number yourself
"new_words": [
{
"id": 123,
"word": "apple",
"vietnamese_meaning": "quả táo",
"part_of_speech": "noun",
"is_concrete": true,
"pronunciation_difficulty": 1,
"audio_url": "https://azure.blob/apple_noun.mp3",
"image_url": "https://unsplash.com/apple-xyz",
"example_sentence_en": "I eat an apple every day",
"example_sentence_vi": "Tôi ăn một quả táo mỗi ngày"
}
// ... 24 more
],
"review_words": [
{
"id": 45,
"word": "hello",
// ... same structure
"user_progress": {
"repetition_count": 2,
"interval_days": 6,
"last_quality_score": 4,
"status": "learning"
}
}
// ... up to 9 more
],
"total_words": 35,
"estimated_duration_minutes": 25
}
Submit word interaction:
POST /api/v1/vocabulary/word-interaction
Request:
{
"word_id": 123,
"quality_score": 5, // 0-5
"is_new_word": true,
"time_spent_seconds": 15
}
Response Case 1 - First interaction (200 OK):
{
"success": true,
"next_review_date": "2026-01-20",
"interval_days": 1,
"status": "learning",
"message": "Tuyệt vời! Từ này sẽ xuất hiện lại vào ngày 20/01"
}
Response Case 2 - Already reviewed today (200 OK):
{
"success": true,
"skipped": true, // Idempotent flag
"message": "Từ này đã được ghi nhận hôm nay",
"next_review_date": "2026-01-20",
"status": "learning"
}
Complete daily lesson:
POST /api/v1/vocabulary/daily-lesson/{day_number}/complete
Request:
{
"words_learned": 25,
"words_reviewed": 10,
"accuracy_rate": 0.85,
"time_spent_minutes": 28
}
Response (200 OK):
{
"completed": true,
"streak_days": 15,
"next_day_unlocked": 16,
"encouragement_message": "Tuyệt vời! Bạn đã hoàn thành ngày 15. Hẹn gặp bạn ngày mai!"
}
Progress Endpoints
GET /api/v1/vocabulary/stats
Response (200 OK):
{
"words_learned": 375,
"words_mastered": 120,
"current_day": 15,
"streak_days": 15,
"completion_percentage": 12.5,
"true_mastery_percentage": 32.0,
"next_checkpoint": {
"day": 40,
"words_remaining": 625
}
}
Checkpoint Test Endpoints
Start test:
POST /api/v1/vocabulary/checkpoint/{day_number}/start
Response (200 OK):
{
"test_id": "ckpt_40_user123_xyz",
"day_number": 40,
"total_questions": 1000,
"page_size": 50, // Paginated to prevent frontend lag
"total_pages": 20,
"test_type": "mixed",
"pass_threshold": 0.70,
"questions": [
// First 50 questions
],
"pagination": {
"current_page": 1,
"has_next": true,
"next_page_url": "/api/v1/vocabulary/checkpoint/40/questions?page=2&test_id=..."
}
}
Submit test:
POST /api/v1/vocabulary/checkpoint/{day_number}/submit
Request:
{
"test_id": "ckpt_40_user123_xyz",
"answers": [
{ "question_id": "q1", "answer": "quả táo" }
// ... all answers
]
}
Response (200 OK):
{
"total_questions": 1000,
"correct_answers": 750,
"score_percentage": 75.0,
"passed": true,
"pass_threshold": 70.0,
"message": "Chúc mừng! Bạn đã vượt qua checkpoint A1 với 75%",
"next_phase_unlocked": "A2",
"weak_words": [
// Words answered incorrectly
]
}
Frontend State Management
LessonStore (with interaction locks):
interface LessonState {
currentDay: number;
newWords: Word[];
reviewWords: Word[];
currentWordIndex: number;
// Interaction locks
submittedWords: Set<number>;
isSubmitting: boolean;
lastSubmitError: Error | null;
// Actions
loadLesson: (dayNumber: number) => Promise<void>;
submitWordInteraction: (wordId: number, qualityScore: number) => Promise<void>;
canSubmitWord: (wordId: number) => boolean;
nextWord: () => void;
completeLesson: () => Promise<void>;
}
ProgressStore (single source of truth):
interface ProgressState {
stats: UserStats;
wordsLearned: Map<number, UserWordProgress>;
currentStreak: number;
// Actions
fetchStats: () => Promise<void>;
updateWordProgress: (wordId: number, backendResponse: any) => void;
// DON'T: self-increment counters (causes drift from backend)
}
Frontend Developer Rules
/**
* RULE 1: Never derive level from day_number
* ✅ GOOD: const level = lessonData.level; // Trust backend
* ❌ BAD: const level = dayNumber <= 40 ? 'A1' : ...;
*/
/**
* RULE 2: Handle duplicate interactions gracefully
* Check response.skipped flag, not error codes
*/
if (response.skipped) {
disableRatingUI(wordId);
toast.info(response.message); // Gentle, not error
}
/**
* RULE 3: Pagination for checkpoints
* Load first 5 pages (250 questions) initially
*/
/**
* RULE 4: Interaction locks
* Always check canSubmitWord() before allowing user action
*/
/**
* RULE 5: Single source of truth
* Always update state from backend responses only
*/
Content Preparation & Data Migration
Oxford 3000 Word List Sourcing
Primary source: Oxford Learner's Dictionaries API
# ASSUMPTION CLARIFICATION:
# Oxford API endpoint /wordlist/en-gb/oxford3000
# returns COMPLETE list in ONE request (not 3000 individual requests)
async def fetch_oxford_3000():
"""
Fetch Oxford 3000 word list
Free tier: 1,000 requests/month
Strategy: Fetch wordlist once, enrich top 500 words only
"""
endpoint = "https://od-api.oxforddictionaries.com/api/v2/wordlist/en-gb/oxford3000"
headers = {
'app_id': OXFORD_APP_ID,
'app_key': OXFORD_API_KEY
}
response = requests.get(endpoint, headers=headers)
words = response.json()['results']
logger.info(f"Fetched {len(words)} words in single request")
return words
Expected distribution:
Oxford 3000 breakdown:
- A1: ~500-700 words
- A2: ~800-1000 words
- B1: ~1200-1500 words
- B2+: ~800 words (exclude)
Target: Balance to exactly 1,000 per level (A1, A2, B1)
Daily Distribution Strategy
Assigning day_introduced (1-120):
async def assign_day_introduced(categorized: Dict) -> List[Dict]:
"""
Assign day_introduced based on:
1. Frequency (most common first)
2. Concreteness (concrete nouns before abstract)
3. Part of speech (nouns → verbs → adjectives)
"""
all_words = []
day_counter = 1
# Phase 1: A1 words (days 1-40)
a1_words = categorized['A1']
a1_words.sort(key=lambda w: (
-w.get('frequency', 0), # Higher frequency first
w['part_of_speech'] != 'noun', # Nouns first
not is_concrete(w) # Concrete first
))
for i, word in enumerate(a1_words):
word['day_introduced'] = day_counter + (i // 25)
word['difficulty_level'] = 'A1'
all_words.append(word)
# ... repeat for A2, B1
return all_words
Audio Generation (Azure TTS)
Strategy: Pre-generate all 3,000 audio files
async def generate_audio_for_word(word: str, output_path: str):
"""
Generate audio using Azure TTS
Voice: en-US-JennyNeural (female, clear)
Format: MP3, 48kHz
"""
speech_config = SpeechConfig(
subscription=AZURE_SPEECH_KEY,
region=AZURE_REGION
)
speech_config.speech_synthesis_voice_name = "en-US-JennyNeural"
# SSML for better pronunciation
ssml = f"""
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-US">
<voice name="en-US-JennyNeural">
<prosody rate="0.9">
{word}
</prosody>
</voice>
</speak>
"""
# ... synthesize and upload to Azure Blob Storage
File naming (handles homonyms):
def generate_audio_filename(word: Dict) -> str:
"""
Format: {word}_{pos}.mp3
Examples:
- run_verb.mp3
- run_noun.mp3
- book_noun.mp3
Prevents collision for words with multiple parts of speech
"""
word_text = word['word'].lower().replace(' ', '_')
pos = word['part_of_speech'][:4]
return f"{word_text}_{pos}.mp3"
Cost:
Azure TTS Neural Voice: $16 per 1M characters
3,000 words × avg 8 characters = 24,000 characters
Cost: $16 × 0.000024 = $0.38 (one-time)
✅ Extremely affordable for pre-generation
Image Sourcing (Unsplash API)
Strategy: Fetch only for concrete words (~1,500 images)
async def fetch_image_for_word(word: str) -> str | None:
"""
Fetch image from Unsplash API
Only for concrete words (is_concrete = true)
"""
endpoint = "https://api.unsplash.com/search/photos"
params = {
'query': word,
'per_page': 1,
'orientation': 'squarish',
'content_filter': 'high'
}
response = requests.get(endpoint, params=params, headers=headers)
if response.status_code == 200:
results = response.json()['results']
if results:
return results[0]['urls']['regular']
return None
Rate limits:
Demo (Free) tier: 50 requests/hour
~1,500 concrete words ÷ 50 = 30 hours to fetch all
Production tier: 5,000 requests/hour
Can fetch all in <1 hour
✅ Use demo tier for MVP, upgrade if needed
Pronunciation Difficulty Tagging
Heuristic-based (not phonetic science):
def calculate_pronunciation_difficulty(word: Dict) -> int:
"""
HEURISTIC score (1-3) for Vietnamese learners
⚠️ IMPORTANT: This is NOT phonetic analysis
- Based on common Vietnamese learner patterns
- Use as guidance, not absolute truth
Difficulty factors:
- /θ/ /ð/ (th): not in Vietnamese → 3
- /v/ /w/: often confused → 2
- Consonant clusters: difficult → 3
"""
word_text = word['word'].lower()
difficulty = 1
if 'th' in word_text:
difficulty = max(difficulty, 3)
if word_text.startswith('v') or word_text.startswith('w'):
difficulty = max(difficulty, 2)
# ... more heuristics
return difficulty
Database Migration Script
async def seed_words_table():
"""
One-time migration to populate words table
Workflow:
1. Fetch Oxford 3000 from API (1 request)
2. Filter to A1-B1 (3,000 words)
3. Assign day_introduced (1-120)
4. Generate audio files (Azure TTS)
5. Fetch images (Unsplash)
6. Calculate pronunciation_difficulty
7. Insert into database
"""
logger.info("Step 1: Fetching Oxford 3000...")
raw_words = await fetch_oxford_3000()
logger.info("Step 2: Filtering to A1-B1...")
categorized = await filter_by_cefr(raw_words)
balanced = await balance_to_1000_per_level(categorized)
logger.info("Step 3: Assigning day_introduced...")
words_with_days = await assign_day_introduced(balanced)
logger.info("Step 4: Generating audio files...")
words_with_audio = await batch_generate_audio(words_with_days)
logger.info("Step 5: Fetching images...")
words_with_images = await batch_fetch_images(words_with_audio)
logger.info("Step 6: Calculating pronunciation difficulty...")
for word in words_with_images:
word['pronunciation_difficulty'] = calculate_pronunciation_difficulty(word)
word['is_concrete'] = is_concrete(word)
logger.info("Step 7: Inserting into database...")
await word_repository.bulk_insert(words_with_images)
logger.info("✅ Migration complete! 3,000 words seeded.")
Data Validation
async def validate_seeded_data():
"""Validate data integrity after migration"""
# Check 1: Total count
total = await word_repository.count()
assert total == 3000, f"Expected 3000 words, found {total}"
# Check 2: Distribution per level
a1_count = await word_repository.count_by_level('A1')
a2_count = await word_repository.count_by_level('A2')
b1_count = await word_repository.count_by_level('B1')
assert a1_count == 1000
assert a2_count == 1000
assert b1_count == 1000
# Check 3: Each day has exactly 25 words
for day in range(1, 121):
day_words = await word_repository.get_by_day(day)
assert len(day_words) == 25
# Check 4: Audio URLs present
missing_audio = await word_repository.count_where(audio_url=None)
if missing_audio > 0:
logger.warning(f"{missing_audio} words missing audio URLs")
# Check 5: No duplicates
duplicates = await word_repository.find_duplicates()
assert len(duplicates) == 0
logger.info("✅ All validation checks passed")
Migration Timeline & Cost
Timeline:
Week 1: Data sourcing
- Day 1-2: Obtain Oxford 3000 list
- Day 3-4: Filter, categorize, assign day_introduced
- Day 5: Manual review
Week 2: Asset generation
- Day 1-2: Generate 3,000 audio files
- Day 3-5: Fetch ~1,500 images
- Day 6-7: Tag pronunciation_difficulty
Week 3: Migration & QA
- Day 1: Run migration script
- Day 2-3: Validation & QA
- Day 4-7: Fix issues, fill gaps
Total: ~3 weeks
Cost:
One-time costs:
- Oxford API: $0-50 (depends on tier)
- Azure TTS: ~$0.40
- Unsplash API: Free
- Total: < $60
Implementation Checklist
Backend
- Create database tables (schema Section 2)
- Implement VocabularyService with guards
- Implement SpacedRepetitionService (SM-2)
- Implement DailyLessonService with deduplication
- Implement CheckpointTestService with sampling
- Implement ProgressTrackingService
- Create API endpoints (Section 4)
- Run migration script to seed 3,000 words
- Validate seeded data
- Write unit tests for SM-2 algorithm
- Write integration tests for API endpoints
Frontend
- Create LessonStore with interaction locks
- Create ProgressStore (single source of truth)
- Create CheckpointStore
- Build WordCard component (flashcard)
- Build AudioPlayer component
- Build QualityRating component (0-5 stars)
- Build DailyLessonPage
- Build CheckpointTestPage (with pagination)
- Build ProgressDashboard
- Implement offline support (cache audio)
- Add performance optimizations (lazy loading)
- Write E2E tests for lesson flow
Content Preparation
- Obtain Oxford API access
- Fetch Oxford 3000 word list
- Filter to A1-B1 (3,000 words)
- Balance to 1,000 per level
- Assign day_introduced (1-120)
- Setup Azure TTS
- Generate 3,000 audio files
- Upload to Azure Blob Storage
- Setup Unsplash API
- Fetch images for concrete words
- Tag is_concrete
- Tag pronunciation_difficulty
- Run data validation
- Manual QA with Vietnamese teachers
Notes & Assumptions
Key Assumptions
- Oxford API:
/wordlist/oxford3000returns full list in one request - Frequency data: Use Oxford if available, fallback to SUBTLEX-US, default to 0
- Audio naming:
{word}_{pos}.mp3format prevents homonym collisions - Pronunciation difficulty: Heuristic score (1-3), not phonetic science
Phase 2 Enhancements (Optional)
- Example audio URLs: Generate audio for example sentences
- Visual priority field: Tag words by image importance (1-3)
- Adaptive difficulty: Adjust based on user performance
- Gamification: Badges for milestones (100 words, 7-day streak)
References & Dependencies
Related Design Documents
- definitions.md: Single source of truth cho constants, daily load, và checkpoint test rules
- diagnosis-test.md: Prescription assignment logic (RX_MU)
- database-schema.md: Tables
words,user_word_progress,daily_lessons,checkpoint_tests - gamification.md: Word count milestones (100, 500, 1500, 3000)
- ai-coach-behavior.md: Celebration messages cho word milestones
Data Dependencies
- Oxford 3000 word list (external source)
- Azure TTS for audio generation
- Unsplash API for images (concrete words only)
Implementation Notes
- Checkpoint tests paginated (50 questions/page) - see Section 4 API
- SM-2 algorithm for spaced repetition - see Section 3 Service Layer
- Daily lesson deduplication - see DailyLessonService
End of Design Document
This design has been validated through interactive brainstorming session and is ready for implementation.