我们可以用 golang
很轻松地进行并发编程,但是在并发编程过程中,得留意 goroutine
的泄露。
什么是 goroutine 泄露
其实 goroutine
泄露就是内存泄露的一种:你开启了一个 goroutine
,但是你开启完了就把它忘了或者不管它了,它以后是死是活,忙不忙,是不是阻塞了,你都就不关心它!(渣男/渣女行为)。然后这个 goroutine
要是一直活着就会占用你的内存资源,虽然一个 goroutine
占用的内存资源很少,但是要是成千上万的 goroutine
泄露了,后果还是很严重的。这就是 goroutine
泄露。
例子
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
|
func doJobs() {
result := make(chan int)
for jobId := 0; jobId < 10; jobId++ {
go hardJob(jobId, result)
}
ctx, cancelFunc := context.WithTimeout(context.Background(), time.Second*3)
defer cancelFunc()
loop:
for {
select {
case _ = <-result:
log.Println("a job has done")
case <-ctx.Done():
ctx.Deadline()
log.Printf("got context err: %s\n", ctx.Err())
break loop
}
}
// going on
select {}
}
func hardJob(jobId int, ch chan<- int) {
log.Printf("a worker is doning hard job:[%d]\n", jobId)
time.Sleep(time.Second * time.Duration(rand.Int31()%10))
ch <- jobId
}
|
这段程序的输出是这样的
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
|
> go test -v . -count=1 -run TestLeakTest
=== RUN TestLeakTest
2021/10/25 21:43:49 leak_test.go:46: a worker is doning hard job:[2]
2021/10/25 21:43:49 leak_test.go:46: a worker is doning hard job:[9]
2021/10/25 21:43:49 leak_test.go:46: a worker is doning hard job:[3]
2021/10/25 21:43:49 leak_test.go:46: a worker is doning hard job:[4]
2021/10/25 21:43:49 leak_test.go:46: a worker is doning hard job:[5]
2021/10/25 21:43:49 leak_test.go:46: a worker is doning hard job:[6]
2021/10/25 21:43:49 leak_test.go:34: a job has done
2021/10/25 21:43:49 leak_test.go:46: a worker is doning hard job:[7]
2021/10/25 21:43:49 leak_test.go:46: a worker is doning hard job:[8]
2021/10/25 21:43:49 leak_test.go:46: a worker is doning hard job:[0]
2021/10/25 21:43:49 leak_test.go:46: a worker is doning hard job:[1]
2021/10/25 21:43:50 leak_test.go:34: a job has done
2021/10/25 21:43:51 leak_test.go:34: a job has done
2021/10/25 21:43:51 leak_test.go:34: a job has done
2021/10/25 21:43:52 leak_test.go:34: a job has done
2021/10/25 21:43:52 leak_test.go:37: got context err: context deadline exceeded
|
可以看到我们明显安排了在第 4 行,明明安排了 10 个工人去干活,但是之后 4 个工人报告自己干完了,别的工人永远阻塞在第 29 行,因为result
这个channel
没有buffer
, 必须得另一边有人取走的同时, worker
才能将值传进channel
, 这是goroutine
泄露的一种常见场景。
解决办法
- 为
channel
设置buffer
, goroutine
为你干完了活,怎么能让工人不把工作成果交给你呢?channel
初始化时设置buffer
可以让工人干完活就退出,从而释放资源。
1
|
result := make(chan int,10)
|
- 要是
ctx.Context
超时了,得通知工人,比如我可以把hardJob
函数这样改:
1
2
3
4
5
6
7
8
9
|
func hardJobWithCtx(ctx context.Context, jobId int, ch chan<- int) {
log.Printf("a worker is doning hard job:[%d]\n", jobId)
time.Sleep(time.Second * time.Duration(rand.Int31()%10))
select {
case <-ctx.Done():
case ch <- jobId:
}
}
|
hardJOb
和hardJobWithContext
的唯一区别就是签名里包含了ctx.Context
,并且用上select
处理两个channel
, 要是超时了,goroutine
就会从 case <- ctx.Done()
这里返回,从而释放资源。
怎么检查 goroutine 泄露呢?
etcd 的做法
我在etcd
的代码库里找到这样一段有意思的代码:
/home/vory/go/pkg/mod/go.etcd.io/etcd/client/pkg/[email protected]/testutil/leak_test.go
1
2
3
4
5
6
7
8
9
|
func TestMain(m *testing.M) {
m.Run()
isLeaked := CheckLeakedGoroutine()
if ranSample && !isLeaked {
fmt.Fprintln(os.Stderr, "expected leaky goroutines but none is detected")
os.Exit(1)
}
os.Exit(0)
}
|
etcd
在TestMain
里这样检查goroutine
泄露。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
|
func CheckLeakedGoroutine() bool {
gs := interestingGoroutines()
if len(gs) == 0 {
return false
}
stackCount := make(map[string]int)
re := regexp.MustCompile(`\(0[0-9a-fx, ]*\)`)
for _, g := range gs {
// strip out pointer arguments in first function of stack dump
normalized := string(re.ReplaceAll([]byte(g), []byte("(...)")))
stackCount[normalized]++
}
fmt.Fprintf(os.Stderr, "Unexpected goroutines running after all test(s).\n")
for stack, count := range stackCount {
fmt.Fprintf(os.Stderr, "%d instances of:\n%s\n", count, stack)
}
return true
}
|
它用 interestingGoroutines()
函数来从runtime.Stack(buffer, true)
得到所有goroutine
的stack trace
, 然后从中过滤掉自身原有的(比如文件描述符的关闭,testing.Main 本身, 网络描述符的关闭)等等一些正常的 goroutine, 从而得到异常的的goroutine
。
Goroutine Leak Detector:uber-go/goleak
https://github.com/uber-go/goleak
uber 有现成的库可以检测goroutine
泄露:用法也很简单:
1
2
3
4
5
|
func TestDoJobs(t *testing.T) {
defer goleak.VerifyNone(t)
doJobs()
}
|
这个测试单元执行完之后,会得到:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
|
leaks.go:78: found unexpected goroutines:
[Goroutine 7 in state sleep, with time.Sleep on top of the stack:
goroutine 7 [sleep]:
time.Sleep(0x1a13b8600)
/usr/lib/go/src/runtime/time.go:193 +0xd2
github.com/eval-exec/golang/leak.hardJob(0x1, 0xc00001e360)
/home/vory/github.com/eval-exec/golang/leak/leak_test.go:42 +0xd6
created by github.com/eval-exec/golang/leak.doJobs
/home/vory/github.com/eval-exec/golang/leak/leak_test.go:21 +0x91
Goroutine 8 in state sleep, with time.Sleep on top of the stack:
goroutine 8 [sleep]:
time.Sleep(0x12a05f200)
/usr/lib/go/src/runtime/time.go:193 +0xd2
github.com/eval-exec/golang/leak.hardJob(0x2, 0xc00001e360)
/home/vory/github.com/eval-exec/golang/leak/leak_test.go:42 +0xd6
created by github.com/eval-exec/golang/leak.doJobs
/home/vory/github.com/eval-exec/golang/leak/leak_test.go:21 +0x91
|
我研究了一下goleak
的实现,就是和etcd
用的方法一样,也是从runtime.Stack()
的结果筛选出泄露的goroutine
。
排查正在运行的进程
可以用"net/http/pprof", curl
一下/pprof/goroutine
, 获得当前正在运行的goroutine
总数和它们的stack trace
牢记
Never start a goroutine without knowing how it will stop
Reference
- https://www.ardanlabs.com/blog/2018/11/goroutine-leaks-the-forgotten-sender.html
- https://www.ardanlabs.com/blog/2018/12/goroutine-leaks-the-abandoned-receivers.html
- https://github.com/uber-go/goleak
- https://github.com/dotmesh-io/dotmesh/blob/92aee93c95f8efe2e526e885d98ba93a456d34ed/cmd/dotmesh-server/pkg/main/rpc.go#L1795
- https://dotmesh.com/blog/finding-goroutine-leaks/