This page looks best with JavaScript enabled

不容忽视的 Goroutine Leak 以及如何排查

 ·  ☕ 4 min read

我们可以用 golang 很轻松地进行并发编程,但是在并发编程过程中,得留意 goroutine 的泄露。

什么是 goroutine 泄露

其实 goroutine 泄露就是内存泄露的一种:你开启了一个 goroutine ,但是你开启完了就把它忘了或者不管它了,它以后是死是活,忙不忙,是不是阻塞了,你都就不关心它!(渣男/渣女行为)。然后这个 goroutine 要是一直活着就会占用你的内存资源,虽然一个 goroutine 占用的内存资源很少,但是要是成千上万的 goroutine 泄露了,后果还是很严重的。这就是 goroutine 泄露。

例子

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
func doJobs() {
	result := make(chan int)
	for jobId := 0; jobId < 10; jobId++ {
		go hardJob(jobId, result)
	}

	ctx, cancelFunc := context.WithTimeout(context.Background(), time.Second*3)
	defer cancelFunc()

loop:
	for {
		select {
		case _ = <-result:
			log.Println("a job has done")
		case <-ctx.Done():
			ctx.Deadline()
			log.Printf("got context err: %s\n", ctx.Err())
			break loop
		}
	}

// going on
	select {}
}

func hardJob(jobId int, ch chan<- int) {
	log.Printf("a worker is doning hard job:[%d]\n", jobId)
	time.Sleep(time.Second * time.Duration(rand.Int31()%10))
	ch <- jobId
}

这段程序的输出是这样的

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
> go test -v  . -count=1 -run TestLeakTest
=== RUN   TestLeakTest
2021/10/25 21:43:49 leak_test.go:46: a worker is doning hard job:[2]
2021/10/25 21:43:49 leak_test.go:46: a worker is doning hard job:[9]
2021/10/25 21:43:49 leak_test.go:46: a worker is doning hard job:[3]
2021/10/25 21:43:49 leak_test.go:46: a worker is doning hard job:[4]
2021/10/25 21:43:49 leak_test.go:46: a worker is doning hard job:[5]
2021/10/25 21:43:49 leak_test.go:46: a worker is doning hard job:[6]
2021/10/25 21:43:49 leak_test.go:34: a job has done
2021/10/25 21:43:49 leak_test.go:46: a worker is doning hard job:[7]
2021/10/25 21:43:49 leak_test.go:46: a worker is doning hard job:[8]
2021/10/25 21:43:49 leak_test.go:46: a worker is doning hard job:[0]
2021/10/25 21:43:49 leak_test.go:46: a worker is doning hard job:[1]
2021/10/25 21:43:50 leak_test.go:34: a job has done
2021/10/25 21:43:51 leak_test.go:34: a job has done
2021/10/25 21:43:51 leak_test.go:34: a job has done
2021/10/25 21:43:52 leak_test.go:34: a job has done
2021/10/25 21:43:52 leak_test.go:37: got context err: context deadline exceeded

可以看到我们明显安排了在第 4 行,明明安排了 10 个工人去干活,但是之后 4 个工人报告自己干完了,别的工人永远阻塞在第 29 行,因为result 这个channel没有buffer, 必须得另一边有人取走的同时, worker才能将值传进channel, 这是goroutine 泄露的一种常见场景。

解决办法

  1. channel 设置buffer, goroutine为你干完了活,怎么能让工人不把工作成果交给你呢?channel初始化时设置buffer可以让工人干完活就退出,从而释放资源。
1
result := make(chan int,10)
  1. 要是 ctx.Context 超时了,得通知工人,比如我可以把hardJob函数这样改:
1
2
3
4
5
6
7
8
9
func hardJobWithCtx(ctx context.Context, jobId int, ch chan<- int) {
	log.Printf("a worker is doning hard job:[%d]\n", jobId)
	time.Sleep(time.Second * time.Duration(rand.Int31()%10))
	
	select {
	case <-ctx.Done():
	case ch <- jobId:
	}
}

hardJObhardJobWithContext的唯一区别就是签名里包含了ctx.Context,并且用上select处理两个channel, 要是超时了,goroutine就会从 case <- ctx.Done() 这里返回,从而释放资源。

怎么检查 goroutine 泄露呢?

etcd 的做法

我在etcd的代码库里找到这样一段有意思的代码:
/home/vory/go/pkg/mod/go.etcd.io/etcd/client/pkg/[email protected]/testutil/leak_test.go

1
2
3
4
5
6
7
8
9
func TestMain(m *testing.M) {
	m.Run()
	isLeaked := CheckLeakedGoroutine()
	if ranSample && !isLeaked {
		fmt.Fprintln(os.Stderr, "expected leaky goroutines but none is detected")
		os.Exit(1)
	}
	os.Exit(0)
}

etcdTestMain里这样检查goroutine泄露。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
func CheckLeakedGoroutine() bool {
	gs := interestingGoroutines()
	if len(gs) == 0 {
		return false
	}

	stackCount := make(map[string]int)
	re := regexp.MustCompile(`\(0[0-9a-fx, ]*\)`)
	for _, g := range gs {
		// strip out pointer arguments in first function of stack dump
		normalized := string(re.ReplaceAll([]byte(g), []byte("(...)")))
		stackCount[normalized]++
	}

	fmt.Fprintf(os.Stderr, "Unexpected goroutines running after all test(s).\n")
	for stack, count := range stackCount {
		fmt.Fprintf(os.Stderr, "%d instances of:\n%s\n", count, stack)
	}
	return true
}

它用 interestingGoroutines()函数来从runtime.Stack(buffer, true)得到所有goroutinestack trace, 然后从中过滤掉自身原有的(比如文件描述符的关闭,testing.Main 本身, 网络描述符的关闭)等等一些正常的 goroutine, 从而得到异常的的goroutine

Goroutine Leak Detector:uber-go/goleak

https://github.com/uber-go/goleak
uber 有现成的库可以检测goroutine泄露:用法也很简单:

1
2
3
4
5
func TestDoJobs(t *testing.T) {
	defer goleak.VerifyNone(t)
	
	doJobs()
}

这个测试单元执行完之后,会得到:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
    leaks.go:78: found unexpected goroutines:
        [Goroutine 7 in state sleep, with time.Sleep on top of the stack:
        goroutine 7 [sleep]:
        time.Sleep(0x1a13b8600)
                /usr/lib/go/src/runtime/time.go:193 +0xd2
        github.com/eval-exec/golang/leak.hardJob(0x1, 0xc00001e360)
                /home/vory/github.com/eval-exec/golang/leak/leak_test.go:42 +0xd6
        created by github.com/eval-exec/golang/leak.doJobs
                /home/vory/github.com/eval-exec/golang/leak/leak_test.go:21 +0x91

         Goroutine 8 in state sleep, with time.Sleep on top of the stack:
        goroutine 8 [sleep]:
        time.Sleep(0x12a05f200)
                /usr/lib/go/src/runtime/time.go:193 +0xd2
        github.com/eval-exec/golang/leak.hardJob(0x2, 0xc00001e360)
                /home/vory/github.com/eval-exec/golang/leak/leak_test.go:42 +0xd6
        created by github.com/eval-exec/golang/leak.doJobs
                /home/vory/github.com/eval-exec/golang/leak/leak_test.go:21 +0x91

我研究了一下goleak的实现,就是和etcd用的方法一样,也是从runtime.Stack()的结果筛选出泄露的goroutine

排查正在运行的进程

可以用"net/http/pprof", curl一下/pprof/goroutine, 获得当前正在运行的goroutine 总数和它们的stack trace

牢记

Never start a goroutine without knowing how it will stop

Reference

  1. https://www.ardanlabs.com/blog/2018/11/goroutine-leaks-the-forgotten-sender.html
  2. https://www.ardanlabs.com/blog/2018/12/goroutine-leaks-the-abandoned-receivers.html
  3. https://github.com/uber-go/goleak
  4. https://github.com/dotmesh-io/dotmesh/blob/92aee93c95f8efe2e526e885d98ba93a456d34ed/cmd/dotmesh-server/pkg/main/rpc.go#L1795
  5. https://dotmesh.com/blog/finding-goroutine-leaks/
Share on

EXEC
WRITTEN BY
EXEC
Evil EXEC