Skip to content

Dictionary

Dictionary Frequency Labels

HanyuGuide shows broad subtitle-frequency labels instead of raw ranks. These labels are meant to help learners prioritize words without implying exact real-world usage across every context.

Label Boundaries

SUBTLEX-CH rank HanyuGuide label
1-1,000 Very common
1,001-5,000 Common
5,001-20,000 Uncommon
20,001+ Rare
No matching rank Not enough subtitle data

Source and Method

Frequency labels are derived from SUBTLEX-CH: Chinese Word and Character Frequencies Based on Film Subtitles by Qing Cai and Marc Brysbaert, PLOS ONE 5(6): e10729. The dataset is licensed under the Creative Commons Attribution 4.0 International license. HanyuGuide normalizes and maps the published word-frequency ranks into learner-facing frequency labels.

The source corpus is based on film and television subtitles, so these labels are best read as subtitle-frequency benchmarks rather than formal HSK levels, textbook levels, or universal spoken-frequency rankings.

Sources: PLOS ONE article, supporting frequency files, and Figshare dataset mirror.

Why Raw Ranks Are Hidden

Raw ranks can look more precise than they are. Dictionary entries may have multiple readings, alternate spellings, or source-normalization differences, so HanyuGuide currently uses coarse buckets on public pages and keeps raw rank values out of public agent and mobile API responses.

See the open-source credits for the full data-source notice.