Problem
It's a common problem to count the number of (visible) characters in a string.
In Golang, we can use utf8.RuneCountInString()
function to count the number of characters in a string.
It works well for most cases, including multi-byte characters like Chinese.
Code example
package main
import (
"fmt"
"unicode/utf8"
)
func main() {
str := "abc世界"
fmt.Println(utf8.RuneCountInString(str)) // 5
}
But if the string contains emoji characters like 👉🏻, then in some cases this function will not calculate correctly.
package main
import (
"fmt"
"unicode/utf8"
)
func main() {
emojiWorld := "🌍"
fmt.Println(utf8.RuneCountInString(emojiWorld)) // 1 ✅ no problem
emojiHand := "👉"
fmt.Println(utf8.RuneCountInString(emojiHand)) // 1 ✅ no problem
emojiHandBlack := "👉🏿"
fmt.Println(utf8.RuneCountInString(emojiHandBlack)) // 2 ❌ Got 2, expected 1.
emojiOne := "1️⃣"
fmt.Println(utf8.RuneCountInString(emojiOne)) // 3 ❌ Got 3, expected 1.
}
You can copy the emoji from here emojipedia.
Why
This is because some emoji are composed of multiple unicode characters (Code Points), while the utf8.RuneCountInString()
function only counts the number of unicode characters.
Term | Description |
---|---|
Bytes | The smallest unit used to measure data storage, typically 8 bits in binary. |
Code Units | Fixed-size units used in encoding schemes to represent a character. In UTF-8, a Code Unit is 8 bits, and in UTF-16, it's 16 bits. |
Code Points | In the Unicode standard, each character is assigned a unique code point, which is a numerical identifier for the character. For example, the code point for the Latin letter "A" is U+0041. |
Grapheme Clusters | Represent the smallest units perceivable in a language, typically a sequence of one or more Code Points. For instance, a letter with an accent mark might be a Grapheme Cluster. |
For example, 1️⃣ this emoji is composed of 3 Code Points, which are:
Rendered by Markdown Table
Emoji Code Points can be found here emojipedia.
Solution
So we need to count the number of Grapheme Clusters instead of Code Points.
Use third-party library rivo/uniseg
package main
import (
"fmt"
"github.com/rivo/uniseg"
)
func main() {
emojiWorld := "🌍"
fmt.Println(uniseg.GraphemeClusterCount(emojiWorld)) // 1 ✅ 没有问题
emojiHand := "👉"
fmt.Println(uniseg.GraphemeClusterCount(emojiHand)) // 1 ✅ 没有问题
emojiHandBlack := "👉🏿"
fmt.Println(uniseg.GraphemeClusterCount(emojiHandBlack)) // 1 ✅ 没有问题
emojiOne := "1️⃣"
fmt.Println(uniseg.GraphemeClusterCount(emojiOne)) // 1 ✅ 没有问题
}
Actually, there are not only emoji, but also some Thai and Arabic characters are composed of multiple unicode characters.