跳到内容

Introduction

This article provides an overview of typesetting multilingual documents on Overleaf using the XeLaTeX (or LuaLaTeX) compiler in conjunction with the fontspec and polyglossia LaTeX packages.

For many, if not most, users their default choice of TeX engine is pdfTeX, which, unlike XeTeX and LuaTeX, does not have a built-in capability to read UTF-8 encoded text files. Using pdfTeX makes typesetting certain languages in LaTeX very complicated, especially those that do not use a Latin-based script. Some packages—such as inputenc, fontenc and arabtex—provide support to pdfTeX for typesetting non-Latin languages and scripts, but not all glyphs and characters may be supported or rendered correctly in the output PDF, even if you’ve used the utf8 or utf8x option with inputenc.

For an in-depth discussion of UTF-8, Unicode encoding and the XeTeX/LuaTeX engines, the Overleaf article Unicode, UTF-8 and multilingual text: An introduction is a fascinating read.

Enter XeTeX and LuaTeX

The XeTeX and LuaTeX engines can directly read/process UTF-8 encoded text; consequently, they offer native support for Unicode—they can also work with TrueType and OpenType fonts directly. These properties make them a natural choice for typesetting multilingual or non-Latin documents in LaTeX, producing outputs like these:

These examples can be found in the Overleaf Gallery: How to Write Multilingual Text with Different Scripts in LaTeX on Overleaf and Multilingual "Thank-You".

If you’re looking to typeset Chinese, Japanese and Korean, have a look at these articles:

Xe(La)TeX is still useful for these languages, but more specialised TeX engines are available, specifically designed for typesetting CJK languages—such as pTeX for typesetting Japanese.

Note that if your cursor seems to be misbehaving whilst editing text in certain languages on Overleaf, you may want to click on the Overleaf Menu button (situated above the project file list) and change the “Font Family” option. You could also try changing your browser’s monospaced font preferences or using Overleaf’s Rich Text view instead. However, at the time of writing, the Source and Rich Text views may not (yet) fully support right-to-left text editing at the level of functionality we are aiming to achieve.

Changing the project’s compiler

The fontspec and polyglossia packages require the XeLaTeX or LuaLaTeX compiler, so you’ll need to set up your Overleaf project to use either of those compilers. Detailed instructions can be found in our article Choosing a LaTeX Compiler but here is a brief video clip showing how to set the compiler for your project:

Once you’re compiling with XeLaTeX or LuaLaTeX, you can (should) remove the inputenc and fontenc packages from your .tex file’s preamble because these Unicode-capable engines will assume input (text) files are UTF-8 encoded. Incidentally, all text files uploaded to Overleaf are converted to UTF-8 so you should usually use utf8 with inputenc when working with the pdfLaTeX and LaTeX compilers on Overleaf.

If your entire document involves just one language

When using the fontspec package you might get away with only setting up a main (serif) font, a sans-serif font and probably a monospaced font designed to support the language you are typesetting—there’s a catch, but we’ll revisit that later in the article. For example, if your entire document is in Greek, with some English words, you can simply write

\usepackage{fontspec}
\setmainfont[Script=Greek]{GFS Artemisia}
\setsansfont[Script=Greek]{GFS Neohellenic}
\setmonofont[Script=Greek]{Noto Mono}
. . .
Το Lorem Ipsum είναι \textsf{απλά} ένα κείμενο χωρίς νόημα
για τους επαγγελματίες της \texttt{τυπογραφίας} και στοιχειοθεσίας.

You can choose fonts from a list of available TrueType and OpenType fonts. The Ligatures=TeX option is added automatically for \setmainfont and \setsansfont, so you don’t have to add that yourself. (\setromanfont is an alias of \setmainfont.)

The LaTeX code above produces the following output:

Multiple languages/scripts in the same document: Introducing polyglossia

If your document contains non-trivial amounts of text in multiple languages, the polyglossia package is helpful to help take care of language-specific typesetting conventions and hyphenation.

\usepackage{fontspec}
\setmainfont{FreeSerif}
\setsansfont{FreeSans}
\setmonofont{FreeMono}

\usepackage{polyglossia}
\setdefaultlanguage{french}
\setotherlanguages{english,russian,thai}

\begin{document}
\begin{abstract}
Le Lorem Ipsum est simplement du faux texte employé dans 
la composition et la mise en page avant impression.
\end{abstract}

Merci. \textenglish{Thank you.} \textrussian{Спасибо.} Et plus de
texte en français!

Le Lorem Ipsum est le faux texte standard ...

\begin{english}
Lorem Ipsum is simply dummy text ...
\end{english}

\begin{russian}
Lorem Ipsum - это текст-`\textsf{рыба}', часто используемый в 
\texttt{печати} и вэб-дизайне. ...
\end{russian}

\begin{thai}
\XeTeXlinebreaklocale "th_TH"
\textenglish{Lorem Ipsum} คือ เนื้อหาจำลองแบบเรียบๆ ที่ใช้กันในธุรกิจงานพิมพ์หรืองานเรียงพิมพ์
\end{thai}

polyglossia lets you set the main language of the document with \setdefaultlanguage (default is English) and (possibly multiple) ‘other’ languages with \setotherlanguages. (\setmainlanguage is an alias of \setdefaultlanguage.) If you expect to be using just one other foreign language you can use the singular \setotherlanguage. The language names are the same as those used by babel.

We’ve prepared a small example of a (primarily) French document which also contains some English, Russian and Thai text. We’ve decided to use the FreeSerif, FreeSans and FreeMono typefaces.

Because the document’s main language is french, the abstract environment automatically produces the heading ‘Résumé’. Notice how, at the end of the first paragraph, the exclamation mark is typeset using the French-spacing typesetting convention: it is set apart from ‘français’ even though it follows immediately after the word français in the source code.

In the main text, short English, Cyrillic and Thai text snippets can be included in a paragraph of French text with \textenglish{Thank you}, \textrussian{Спасибо} and \textthai{ขอบคุณ}. Generally, you can use \textLANGUAGE{...} to typeset text in any LANGUAGE that has been declared by \setdefaultlanguage and \setotherlanguages. Because the document’s main (serif) font is FreeSerif, and FreeSerif contains glyphs for Latin, Cyrillic and Thai (and more!) scripts, fontspec and polyglossia can use it to render all these texts into the output PDF.

For longer paragraphs of text in foreign/other languages, it is recommended to use \begin{LANGUAGE}...\end{LANGUAGE}, e.g. \begin{russian}...\end{russian}, \begin{thai}...\end{thai}. In the case of Arabic you can’t use \begin{arabic}...\end{arabic}; you’ll have to write \begin{Arabic}...\end{Arabic} instead, while \textarabic{...} is still valid.

Some considerations may be needed for certain languages: for instance, within the thai environment, the words Lorem Ipsum need to be wrapped in a \textenglish{...} (or \textfrench{...}) command to ensure they are rendered using the Latin-script glyphs.

At this point you might ask: If FreeSerif is so versatile and contains glyphs for Russian and Thai anyway, why would we still need to use \textrussian, \begin{english}...\end{english} etc? Wouldn’t that be redundant? Let’s see what happens when we remove the \begin{english}...\end{english} and \begin{russian}...\end{russian} environments:

Certainly, the Latin and Cyrillic glyphs are all rendered in the output PDF, but note that some words are now hyphenated incorrectly: ‘unk-nown’ and ‘unchan-ged’—and стандартной isn’t hyphenated at all. Without the language-switching environments, the compiler thinks these text items are still in the French language and attempts to typeset them using French conventions. The compiler tries to apply French hyphenation rules which, naturally, produce incorrect results. This is why typography and typesetting is so much more than just font design and selection: they are very language- and culture-specific disciplines.

Revisiting our first Greek example, we now see why it is a good idea to load polyglossia and use \setdefaultlanguage{greek}: to ensure the document is typeset following Greek conventions.

Mixing right-to-left (RTL) and left-to-right (LTR) languages

You need to be careful when typesetting a mixture of right-to-left (RTL) scripts, such as Arabic or Hebrew, and left-to-right (LTR) scripts in the same document. Consider the following small Arabic document with an English word, using Amiri as the main font:

\usepackage{polyglossia}
\setdefaultlanguage{arabic}
\setmainfont{Amiri}
\begin{document}
ما هو differentiation
\end{document}

which produces:

The text is automatically set right-to-left, starting on the right-hand edge of the page. The word “differentiation” itself is typeset correctly as left-to-right text–but wait, no it’s not! It’s rendered as “dffirentiation” in the output! What’s going on?

The Amiri font does have glyphs for Latin alphabets but here the text differentiation is not marked as English: the compiler treats differentiation as right-to-left text, as if it were a sequence of Arabic characters. During typesetting, the original sequence iff is processed as ffi (i.e., as RTL text) and Amiri’s ligature glyph for “ffi” is typeset. Marking the word with \textenglish{...} ensures it is interpreted correctly as left-to-right text.

\setmainfont{Amiri}
\setotherlanguage{english}
\newfontfamily\englishfont{TeX Gyre Termes}
\begin{document}
ما هو \textenglish{differentiation}

Note: If you’re used to the babel package commands you’ll be happy to hear that the commands \selectlanguage, \foreignlanguage and the environment otherlanguage are also supported by polyglossia.

Language-specific options

Some languages support additional options for customisation; for example, greek accepts a variant=ancient, mono or poly option for ancient, monotonic or polytonic Greek; hindi can be configured with numerals=western or devanagari. See the polyglossia package documentation for details.

These can be specified when loading the language:

\setdefaultlanguage[variant=poly]{greek}
\setotherlanguage[numerals=western]{hindi}

or later at anytime:

\setkeys{greek}{variant=ancient}

or even locally for a specific environment:

\begin{greek}[variant=ancient]
...
\end{greek}

Specifying fonts for specific languages

You can specify the font used for different languages. Suppose you’d like to typeset all English text (contained in our previous example) in italics; you could write:

\newfontfamily\englishfont{FreeSerif Italic}

You can of course use something even more flamboyant:

\newfontfamily\englishfont{Chancery Uralic}

This mechanism of setting fonts for different languages or scripts is especially important when you use a main font that does not have glyphs for all scripts or languages in your document. Suppose we now decide to use Caladea as the main document font:

\setmainfont{Caladea}

Upon compilation we would see the following error:

Package polyglossia Error: The current roman font
does not contain the Cyrillic script!

(polyglossia)                Please define
\cyrillicfont with \newfontfamily.

See the polyglossia package documentation for
explanation.
Type  H <return>  for immediate help.
 ...

l.15 \select@language {russian}

Package polyglossia Error: The current roman font
does not contain the Thai script!

(polyglossia)                Please define
\thaifont with \newfontfamily.

See the polyglossia package documentation for
explanation.
Type  H <return>  for immediate help.
 ...

l.23 \select@language {thai}
...

We are now obligated to specify which fonts to use for Cyrillic and Thai scripts. Again, you can refer to the list of available TrueType and OpenType fonts on Overleaf.

\newfontfamily\cyrillicfont[Script=Cyrillic]{Charis SIL}
\newfontfamily\thaifont[Script=Thai]{Garuda}

Note: it is outside the scope of this article to address issues relating to choices of aesthetically-pleasing and typographically-compatible font combinations.

Notice that we’ve defined \cyrillicfont instead of \russianfont, i.e. we defined a font for the Cyrillic script rather than the Russian language. The advantage of defining \cyrillicfont is that if, for example, serbian is also a defined language in your project, then \textserbian would automatically use the defined \cyrillicfont. If you had defined only \russianfont, then using \textserbian would again complain about “the current roman font does not contain the Cyrillic script” and you would need to define \cyrillicfont anyway — unless you did mean to use a different font for Serbian text!

Another similar scenario is the Devanagari script, which is used for the Hindi and Sanskrit languages; or the Arabic script used for Arabic and Farsi (Persian).

\setdefaultlanguage{english}
\setotherlanguages{hindi,sanskrit}
\newfontfamily\devanagarifont[Script=Devanagari]{Lohit Devanagari}
...
Hindi: \texthindi{हिन्दी}
Sanskrit: \textsanskrit{संस्कृतम्}

When using \newfontfamily it is necessary to specify the Script, otherwise some glyphs may be rendered incorrectly; for example, if we had written only \newfontfamily\thaifont{Garuda} the typeset result may be wrong (left image below)—the correct output is produced by adding [Script=Thai].

Wrong: Correct:

Defining other font families

Let’s have a look at another example, this time with Hebrew:

\documentclass{article}
\usepackage{polyglossia}
\setdefaultlanguage[numerals=hebrew]{hebrew}
\setotherlanguage{english}
\newfontfamily\hebrewfont[Script=Hebrew]{Hadasim CLM}
\begin{document}
\section{מבוא}
זוהי עובדה מבוססת שדעתו של הקורא תהיה מוסחת עלידי טקטס קריא כאשר הוא יביט בפריסתו.  -
\end{document}

So far so good. Now suppose we were using a template originally created for an English document, which sets section headers in sans serif type using the titlesec package:

\RequirePackage{titlesec}
\titleformat{\section}{\Large\sffamily\bfseries}{\thesection}{1em}{}
\usepackage{polyglossia}
\setdefaultlanguage[numerals=hebrew]{hebrew}
...

We are confronted with the error message:

Package polyglossia Error: The current roman font
does not contain the Hebrew script!

(polyglossia)                Please define
\hebrewfont with \newfontfamily.

See the polyglossia package documentation for
explanation.
Type  H <return>  for immediate help.
 ...

l.27 \section{מבוא}

This is a bit confusing: didn’t we already define \hebrewfont to be Hadasim CLM? Well, it’s really because we haven’t specified a sans serif font for Hebrew. Let’s remedy this by adding a definition for \hebrewfontsf:

\newfontfamily\hebrewfontsf[Script=Hebrew]{Miriam CLM}

And now we have the output:

Should the need arise, we could also define a monospaced font to use with \hebrewfonttt.

Acknowledgements

All lorem ipsum snippets, in various languages, are from https://lipsum.com.

Overleaf guides

LaTeX Basics

Mathematics

Figures and tables

References and Citations

Languages

Document structure

Formatting

Fonts

Presentations

Commands

Field specific

Class files

Advanced TeX/LaTeX