Go性能分析工具

Go语言作为一个“新兴语言”，它自身为我们提供了一套用于性能测试和分析的工具和接口，这里咱们结合一个具体的实例来讲解一下简单的调优过程。

一、常用的测试命令

关于go test命令的详细讲解，我们在测试工具一节有过总结。这里不再赘述，总结几个常用的命令场景：

  //启动指定的测试函数
  go test -run=xxx $packetpath

  //编译一个测试binary文件
  go test -c -o $bainaryName

  //启动一个bench测试，不运行功能测试函数，并显示测试过程中分配mem的数量
  go test -bech=xxx -run=None -benchmem

  //启动profile分析
  go test -bench=xxx -run=None -cpuprofile=cpu.out    
  go test -bench=xxx -run=None -blockprofile=block.out    
  go test -bench=xxx -run=None -memprofile=mem.out

  //执行测试并生成coverprofile 文件
  go test -run=TransferProofGenerator -cover -coverprofile=c.out

  //查看覆盖率，通过go tool 工具
  go tool cover -html=c.out

无论是cpu分析、阻塞分析、堆分析，主要目的都是为了消耗更少的计算资源、内存资源来取得更高的效率，但有时候时间和空间并不能兼得，用空间换时间是很常用的一种做法，需要根据具体情况进行权衡。

二、性能剖析流程

性能调优首要的任务就是找出消耗时间最多的函数。一般步骤如下：

收集程序的profile
查看profile数据

收集程序的profile

一般，收集profile的方法有两种：一种是我们在测试工具节中介绍的命令行方式。另一种就是通过调用相关模块来收集，我们以CPUProfile为例：

  import( 
    "runtime/pprof"
    "testing"
    "os"
  )

  func BenchmarkXXX(b *testing.B){
    fd, _ := os.OpenFile("cpu.log", os.O_CREATE| os.O_RDWR, os.ModeSetuid)    
    if err := pprof.StartCPUProfile(fd); err != nil{
      b.Fatalf("TestTransferProofValidator %s", err.Error())
    }    

    defer func() {
      pprof.StopCPUProfile()
      fd.Close()
    }()

  }

这样在执行后自动就会生成cpuprofile文件，这里是将函数放在里Bench测试函数中，也可以将它放在正常函数中比如main函数。有的时候需要实时的监听web服务的运行状态，这个时候就可以通过引入net/http/pprof来收集profile：

  package main

  import (
    "net/http"
    "github.com/utxo/webserver/router"
    _ "net/http/pprof"
  )

  func main()  {
    handler:= http.FileServer(http.Dir("./template"));
    http.Handle("/", handler)
    http.HandleFunc("/transfer", router.Transfer);
    http.ListenAndServe("0.0.0.0:8082", nil);
  }

此时会默认开启proof收集服务，go tool作为客户端来访问该router来收集profile:

 go tool pprof  http://localhost:8082/debug/pprof/profile

在收集结束后会进入cmd命令界面（默认进行 30s 的 CPU Profiling收集），通过命令行来查看、处理收集到的数据。

查看profile数据

go tool pprof -text -nodecount=10  cpu.log
go tool pprof list  cpu.log

-nodecount 函数会为我们列出cpu时间消耗前十的函数。也可以进入cmd模式，进行更详细的剖析使用list命令参数。

调优实战

这里展示的是一个零知识证明的生成和校验的算法的性能分析过程，按照上节所说我们先收集profile，使用的方法是调用pprof模块的方法：

func BenchmarkTransferProofValidator(b *testing.B) {
    once.Do(testInit)
    fd, _ := os.OpenFile("cpu_validator.log", os.O_CREATE| os.O_RDWR, os.ModeSetuid)


    inputs, T := generateInputs(100, big.NewInt(1000000))
    proof, output, B, err := TransferProofGenerator(inputs, T, pairsA, pairsB,
        zkpKeyPairsA, zkpKeyPairsB, zkpRange)

    if err != nil{
      b.Fatalf("TestTransferProofValidator: %s", err.Error())
    }   

    if err := pprof.StartCPUProfile(fd); err != nil{
      b.Fatalf("TestTransferProofValidator %s", err.Error())
    }

    defer func() {
      pprof.StopCPUProfile()
      fd.Close()
    }()

    for i := 0; i < b.N; i++ {
      err = TransferProofValidator(inputs, output, B, pairsA, pairsB,
          zkpKeyPairsA, zkpKeyPairsB, proof)        
      if err != nil{
          b.Fatalf("TestTransferProofValidator: %s", err.Error())
      }
    }
}

执行如下命令，并得到执行结果：

$go test -c -bench=BenchmarkTransferProofValidator -run=None

 D:\gopath\src\github.com\utxo\balance>go test -run=None -bench=Va goos: windows goarch: amd64 pkg: github.com/utxo/balance
 BenchmarkTransferProofValidator-4              3         397296900 ns/op
 PASS
 ok      github.com/utxo/balance 3.237s

TransferProofValidator运行了三次，gomaxprocs = 4，这些都是go tool自动分配，暂时不用深究。可以看到我们的函数运行十分的缓慢，执行一次需要 0.4s 左右，它将作为一个高频调用的模块远远达不到使用的需求。那么，我们就要找到最慢的部分在哪里，找到之后进而去优化掉它。

$go tool pprof list  cpu_validator.log

 D:\gopath\src\github.com\utxo\balance> go tool pprof list  cpu_validator.log
 list: open list: The system cannot find the file specified.
 Fetched 1 source profiles out of 2
 Main binary filename not available.
 Type: cpu
 Time: Dec 20, 2018 at 11:15pm (CST)
 Duration: 625.81ms, Total samples = 420ms (67.11%)
 Entering interactive mode (type "help" for commands, "o" for options)
 (pprof) top
 Showing nodes accounting for 420ms, 100% of 420ms total
 Showing top 10 nodes out of 25
     flat  flat%   sum%        cum   cum%
     240ms 57.14% 57.14%      240ms 57.14%  math/big.addMulVVW
     110ms 26.19% 83.33%      380ms 90.48%  math/big.nat.montgomery
     30ms  7.14% 90.48%       30ms  7.14%  runtime.memmove
     10ms  2.38% 92.86%       10ms  2.38%  math/big.(*Int).QuoRem
     10ms  2.38% 95.24%       10ms  2.38%  math/big.nat.add
     10ms  2.38% 97.62%       10ms  2.38%  runtime.lock
     10ms  2.38%   100%       20ms  4.76%  runtime.newstack
         0     0%   100%      400ms 95.24%  github.com/utxo/balance.BenchmarkTransferProofValidator
         0     0%   100%      400ms 95.24%  github.com/utxo/balance.TransferProofValidator
         0     0%   100%      400ms 95.24%  github.com/utxo/crypto/base.Exp

每一行表示一个函数的信息。前两列表示函数在 CPU 上运行的时间以及百分比；第三列是当前所有函数累加使用 CPU 的比例；第四列和第五列代表这个函数以及子函数运行所占用的时间和比例（也被称为累加值 cumulative），应该大于等于前两列的值；最后一列就是函数的名字。如果应用程序有性能问题，上面这些信息应该能告诉我们时间都花费在哪些函数的执行上了。

所以，按照输出信息看，大部分的时间都花费在了Exp的执行过程中，可以通过list命令来仔细分析一下Exp函数中的哪一部分比较耗时：

 Total: 420ms
 ROUTINE ======================== github.com/utxo/crypto/base.Exp in D:\gopath\src\github.com\utxo\crypto\base\base.go         0      400ms (flat, cum) 95.24% of Total
         .          .     91:           return E
         .          .     92:   }
         .          .     93:
         .          .     94:   if x.Sign() < 1{
         .          .     95:           x = (&big.Int{}).Neg(x)
         .       30ms     96:           E = (&big.Int{}).Exp(g, x, n)
         .          .     97:           if 1 != Gcd(E, n).Int64(){
         .          .     98:                   log.Printf("h_r is not prime with n")
         .          .     99:                   return nil
         .          .    100:           }
         .          .    101:
         .       20ms    102:           E = (&big.Int{}).ModInverse(E, n)
         .          .    103:           E = (&big.Int{}).Mod(E, n)
         .          .    104:
         .          .    105:   }else {
         .      350ms    106:           E = (&big.Int{}).Exp(g, x, n)
         .          .    107:   }
         .          .    108:
         .          .    109:   return E
         .          .    110:}
         .          .    111:

如上所示，base.Exp 总计占用了 95.24% 的cpu时间，而整个函数内最耗时的就是 big.Exp() 函数。这就给作者制造了个难题，因为在本模块中涉及到大量的指数运算，而且输入数据都较大（平均 512bit）。定位到了问题的所在，接下来我们要想办法进行调优。本模块目前是纯粹的算法模块，可以考虑进行并发改造，来发挥go协程的特点，尤其是在调用base.Exp的部分中。但作者将尝试并发改造后发现改造的结果并不理想，因为使用channel进行同步导致阻塞，‘抵消’了多协程带来的性能提升。另外一个思路就是尽量避免big.Exp指数运算，即将指数型运算转换为其他时间复杂度较低的运算，多协程是利用go本身的特性和cpu的多核运算。而这一种方法则纯粹是从数学角度上进行优化，并不具有普适性。

这里，只是带大家感受下这个调优的过程，具体问到具体的场景下要具体分析。

一、常用的测试命令

二、性能剖析流程

收集程序的profile

查看profile数据

调优实战

凯冰科技 · 代码改变世界，技术改变生活