Skip to main content

2 posts tagged with "String"

View All Tags

· 3 min read
wen

Problem

It's a common problem to count the number of (visible) characters in a string.

In Golang, we can use utf8.RuneCountInString() function to count the number of characters in a string.

It works well for most cases, including multi-byte characters like Chinese.

Code example

package main

import (
"fmt"
"unicode/utf8"
)

func main() {
str := "abc世界"
fmt.Println(utf8.RuneCountInString(str)) // 5
}

But if the string contains emoji characters like 👉🏻, then in some cases this function will not calculate correctly.

package main

import (
"fmt"
"unicode/utf8"
)

func main() {
emojiWorld := "🌍"
fmt.Println(utf8.RuneCountInString(emojiWorld)) // 1 ✅ no problem

emojiHand := "👉"
fmt.Println(utf8.RuneCountInString(emojiHand)) // 1 ✅ no problem

emojiHandBlack := "👉🏿"
fmt.Println(utf8.RuneCountInString(emojiHandBlack)) // 2 ❌ Got 2, expected 1.

emojiOne := "1️⃣"
fmt.Println(utf8.RuneCountInString(emojiOne)) // 3 ❌ Got 3, expected 1.
}

You can copy the emoji from here emojipedia.

Why

This is because some emoji are composed of multiple unicode characters (Code Points), while the utf8.RuneCountInString() function only counts the number of unicode characters.

TermDescription
BytesThe smallest unit used to measure data storage, typically 8 bits in binary.
Code UnitsFixed-size units used in encoding schemes to represent a character. In UTF-8, a Code Unit is 8 bits, and in UTF-16, it's 16 bits.
Code PointsIn the Unicode standard, each character is assigned a unique code point, which is a numerical identifier for the character. For example, the code point for the Latin letter "A" is U+0041.
Grapheme ClustersRepresent the smallest units perceivable in a language, typically a sequence of one or more Code Points. For instance, a letter with an accent mark might be a Grapheme Cluster.

For example, 1️⃣ this emoji is composed of 3 Code Points, which are:

img

Rendered by Markdown Table

tip

Emoji Code Points can be found here emojipedia.

Solution

So we need to count the number of Grapheme Clusters instead of Code Points.

Use third-party library rivo/uniseg

package main

import (
"fmt"

"github.com/rivo/uniseg"
)

func main() {
emojiWorld := "🌍"
fmt.Println(uniseg.GraphemeClusterCount(emojiWorld)) // 1 ✅ 没有问题

emojiHand := "👉"
fmt.Println(uniseg.GraphemeClusterCount(emojiHand)) // 1 ✅ 没有问题

emojiHandBlack := "👉🏿"
fmt.Println(uniseg.GraphemeClusterCount(emojiHandBlack)) // 1 ✅ 没有问题

emojiOne := "1️⃣"
fmt.Println(uniseg.GraphemeClusterCount(emojiOne)) // 1 ✅ 没有问题
}
tip

Actually, there are not only emoji, but also some Thai and Arabic characters are composed of multiple unicode characters.

Reference

Go で文字数をカウントする 在 Go 中计算字符数

文字数をカウントする 7 つの方法

Go: Unicode と rune 型

· One min read
wen

背景

labstack/gommon 的代码中 看到了这个库 valyala/fasttemplate, 于是就去 调查了一下。

tip

labstack 是 Echo Web Framework 的 Organization。

什么是 fasttemplate

fasttemplate 是一个高效的 Go 模板引擎,比 Go 标准库的模板引擎 text/template 快很多,

而且比 strings.Replace, strings.Replacerfmt.Fprintf 都要快。

img

具体可以看一下 fasttemplate 的 benchmark

fasttemplate 的使用

基础用法

    template := "http://{{host}}/?q={{query}}&foo={{bar}}{{bar}}"
t := fasttemplate.New(template, "{{", "}}")
s := t.ExecuteString(map[string]interface{}{
"host": "google.com",
"query": url.QueryEscape("hello=world"),
"bar": "foobar",
})
fmt.Printf("%s", s)

// Output:
// http://google.com/?q=hello%3Dworld&foo=foobarfoobar

高阶用法

    template := "Hello, [user]! You won [prize]!!! [foobar]"
t, err := fasttemplate.NewTemplate(template, "[", "]")
if err != nil {
log.Fatalf("unexpected error when parsing template: %s", err)
}
s := t.ExecuteFuncString(func(w io.Writer, tag string) (int, error) {
switch tag {
case "user":
return w.Write([]byte("John"))
case "prize":
return w.Write([]byte("$100500"))
default:
return w.Write([]byte(fmt.Sprintf("[unknown tag %q]", tag)))
}
})
fmt.Printf("%s", s)

// Output:
// Hello, John! You won $100500!!! [unknown tag "foobar"]