How to count number of characters in string with emoji in Golang

January 14, 2024 · 3 min read

wen

Maintainer of thewang

Problem

It's a common problem to count the number of (visible) characters in a string.

In Golang, we can use utf8.RuneCountInString() function to count the number of characters in a string.

It works well for most cases, including multi-byte characters like Chinese.

Code example

package main

import (
    "fmt"
    "unicode/utf8"
)

func main() {
    str := "abc世界"
    fmt.Println(utf8.RuneCountInString(str)) // 5
}

But if the string contains emoji characters like 👉🏻, then in some cases this function will not calculate correctly.

package main

import (
    "fmt"
    "unicode/utf8"
)

func main() {
    emojiWorld := "🌍"
    fmt.Println(utf8.RuneCountInString(emojiWorld)) // 1  ✅ no problem

    emojiHand := "👉"
    fmt.Println(utf8.RuneCountInString(emojiHand)) // 1  ✅ no problem

    emojiHandBlack := "👉🏿"
    fmt.Println(utf8.RuneCountInString(emojiHandBlack)) // 2 ❌ Got 2, expected 1.

    emojiOne := "1️⃣"
    fmt.Println(utf8.RuneCountInString(emojiOne)) // 3 ❌ Got 3, expected 1.
}

You can copy the emoji from here emojipedia.

Why

This is because some emoji are composed of multiple unicode characters (Code Points), while the utf8.RuneCountInString() function only counts the number of unicode characters.

Term	Description
Bytes	The smallest unit used to measure data storage, typically 8 bits in binary.
Code Units	Fixed-size units used in encoding schemes to represent a character. In UTF-8, a Code Unit is 8 bits, and in UTF-16, it's 16 bits.
Code Points	In the Unicode standard, each character is assigned a unique code point, which is a numerical identifier for the character. For example, the code point for the Latin letter "A" is U+0041.
Grapheme Clusters	Represent the smallest units perceivable in a language, typically a sequence of one or more Code Points. For instance, a letter with an accent mark might be a Grapheme Cluster.

For example, 1️⃣ this emoji is composed of 3 Code Points, which are:

Rendered by Markdown Table

tip

Emoji Code Points can be found here emojipedia.

Solution

So we need to count the number of Grapheme Clusters instead of Code Points.

Use third-party library rivo/uniseg

package main

import (
    "fmt"

    "github.com/rivo/uniseg"
)

func main() {
    emojiWorld := "🌍"
    fmt.Println(uniseg.GraphemeClusterCount(emojiWorld)) // 1  ✅ 没有问题

    emojiHand := "👉"
    fmt.Println(uniseg.GraphemeClusterCount(emojiHand)) // 1  ✅ 没有问题

    emojiHandBlack := "👉🏿"
    fmt.Println(uniseg.GraphemeClusterCount(emojiHandBlack)) // 1  ✅ 没有问题

    emojiOne := "1️⃣"
    fmt.Println(uniseg.GraphemeClusterCount(emojiOne)) // 1  ✅ 没有问题
}

tip

Actually, there are not only emoji, but also some Thai and Arabic characters are composed of multiple unicode characters.

Reference

Go で文字数をカウントする在 Go 中计算字符数

文字数をカウントする 7 つの方法

Go: Unicode と rune 型

Problem​

Code example​

Why​

Solution​

Use third-party library rivo/uniseg​

Reference​

Problem

Code example

Why

Solution

Use third-party library rivo/uniseg

Reference