Skip to main content

One post tagged with "emoji"

View All Tags

· 3 min read
wen

Problem

It's a common problem to count the number of (visible) characters in a string.

In Golang, we can use utf8.RuneCountInString() function to count the number of characters in a string.

It works well for most cases, including multi-byte characters like Chinese.

Code example

package main

import (
"fmt"
"unicode/utf8"
)

func main() {
str := "abc世界"
fmt.Println(utf8.RuneCountInString(str)) // 5
}

But if the string contains emoji characters like 👉🏻, then in some cases this function will not calculate correctly.

package main

import (
"fmt"
"unicode/utf8"
)

func main() {
emojiWorld := "🌍"
fmt.Println(utf8.RuneCountInString(emojiWorld)) // 1 ✅ no problem

emojiHand := "👉"
fmt.Println(utf8.RuneCountInString(emojiHand)) // 1 ✅ no problem

emojiHandBlack := "👉🏿"
fmt.Println(utf8.RuneCountInString(emojiHandBlack)) // 2 ❌ Got 2, expected 1.

emojiOne := "1️⃣"
fmt.Println(utf8.RuneCountInString(emojiOne)) // 3 ❌ Got 3, expected 1.
}

You can copy the emoji from here emojipedia.

Why

This is because some emoji are composed of multiple unicode characters (Code Points), while the utf8.RuneCountInString() function only counts the number of unicode characters.

TermDescription
BytesThe smallest unit used to measure data storage, typically 8 bits in binary.
Code UnitsFixed-size units used in encoding schemes to represent a character. In UTF-8, a Code Unit is 8 bits, and in UTF-16, it's 16 bits.
Code PointsIn the Unicode standard, each character is assigned a unique code point, which is a numerical identifier for the character. For example, the code point for the Latin letter "A" is U+0041.
Grapheme ClustersRepresent the smallest units perceivable in a language, typically a sequence of one or more Code Points. For instance, a letter with an accent mark might be a Grapheme Cluster.

For example, 1️⃣ this emoji is composed of 3 Code Points, which are:

img

Rendered by Markdown Table

tip

Emoji Code Points can be found here emojipedia.

Solution

So we need to count the number of Grapheme Clusters instead of Code Points.

Use third-party library rivo/uniseg

package main

import (
"fmt"

"github.com/rivo/uniseg"
)

func main() {
emojiWorld := "🌍"
fmt.Println(uniseg.GraphemeClusterCount(emojiWorld)) // 1 ✅ 没有问题

emojiHand := "👉"
fmt.Println(uniseg.GraphemeClusterCount(emojiHand)) // 1 ✅ 没有问题

emojiHandBlack := "👉🏿"
fmt.Println(uniseg.GraphemeClusterCount(emojiHandBlack)) // 1 ✅ 没有问题

emojiOne := "1️⃣"
fmt.Println(uniseg.GraphemeClusterCount(emojiOne)) // 1 ✅ 没有问题
}
tip

Actually, there are not only emoji, but also some Thai and Arabic characters are composed of multiple unicode characters.

Reference

Go で文字数をカウントする 在 Go 中计算字符数

文字数をカウントする 7 つの方法

Go: Unicode と rune 型