Web grpc-go source code trace - gRPC client 與 server 建立連線過程

前言

其實一開始的目的是想要研究 gRPC 的 retry 機制,不過在了解 retry 之前勢必要先說明整個 gRPC client 與 server 建立連線的過程,因此就先用 source code trace 的方式簡單說明在呼叫 grpc.Dial 後所執行的連線流程,包含 gRPC 實現 load-balancing 的機制。

packages

grpc-go v1.19.1

grpc.Dial 的背後

以下是 client 向 server 連線的基本方式。

1conn, err := grpc.Dial(serverIpAddress, grpc.WithInsecure())
2
3func Dial(target string, opts ...DialOption) (*ClientConn, error) {
4  return DialContext(context.Background(), target, opts...)
5}

首先, Client 需要透過 Name Resolver 解析 Dial 中的 target string 來取得 Server 正確的 IP Addresses 和 port,以利後續建立 connection。例如使用 DNS Name Resolver,就可以透過 conn, err := grpc.Dial("dns:///your.target.name:8888") 這種 Domain name 的方式來發送 RPCs。

關於 Name Resolution 部分可以參閱 官方文件 說明。

 1// Notice: 部分 source code 被移除
 2func DialContext(ctx context.Context, target string, opts ...DialOption) (conn *ClientConn, err error) {
 3  if cc.dopts.resolverBuilder == nil {
 4	  cc.parsedTarget = parseTarget(cc.target)
 5	  grpclog.Infof("parsed scheme: %q", cc.parsedTarget.Scheme)
 6	  cc.dopts.resolverBuilder = resolver.Get(cc.parsedTarget.Scheme)
 7	  if cc.dopts.resolverBuilder == nil {
 8		  // If resolver builder is still nil, the parsed target's scheme is
 9		  // not registered. Fallback to default resolver and set Endpoint to
10		  // the original target.
11			grpclog.Infof("scheme %q not registered, fallback to default scheme", cc.parsedTarget.Scheme)
12			cc.parsedTarget = resolver.Target{
13				Scheme:   resolver.GetDefaultScheme(),
14				Endpoint: target,
15			}
16			// Default use passthrough resolver builder (passthrough.go)
17			cc.dopts.resolverBuilder = resolver.Get(cc.parsedTarget.Scheme)
18		}
19	}
20	
21	rWrapper, err := newCCResolverWrapper(cc)
22	if err != nil {
23		return nil, fmt.Errorf("failed to build resolver: %v", err)
24	}
25}

Resolver 是一個 interface,使用者可以根據需求來註冊不同實作的 resolver,例如 etcd-io/etcd 。而在沒有指定 resolver 情況下(我們可以使用 URI scheme 來選擇特定 resolver, e.g. etcd:// ), grpc-go 會使用預設 passthrough resolver 去解析名稱。(passthrough 其實就是很單純地把 target 再返回來,因此僅適用於簡單應用或是測試。另外在 gRPC 文件中有提到,預設是使用 DNS,不過 grpc-go 卻是用 passthrough)。

想參考第三方實作 resolver ,可參見 etcd clientv3 resolver

1func newCCResolverWrapper(cc *ClientConn) (*ccResolverWrapper, error) {
2  ccr := &ccResolverWrapper{
3    cc:     cc,
4    addrCh: make(chan []resolver.Address, 1),
5    scCh:   make(chan string, 1),
6  }
7  // Create a resolver from resolver builder
8  ccr.resolver, err = rb.Build(cc.parsedTarget, ccr, resolver.BuildOption{DisableServiceConfig: cc.dopts.disableServiceConfig})
9}

以下以 default 的 passthrough resolver 來說明接下去的流程。

 1// Default use passthrough Builder.
 2func (*passthroughBuilder) Build(target resolver.Target, cc resolver.ClientConn, opts resolver.BuildOption) (resolver.Resolver, error) {
 3  r := &passthroughResolver{
 4		target: target,
 5		cc:     cc,
 6	}
 7	r.start()
 8}
 9
10func (r *passthroughResolver) start() {
11  // 直接返回使用者輸入的 target
12	r.cc.UpdateState(resolver.State{Addresses: []resolver.Address{{Addr: r.target.Endpoint}}})
13}
14
15func (ccr *ccResolverWrapper) UpdateState(s resolver.State) {
16  // 透過 wrapper 通知 ClientConn 更新 State
17  ccr.cc.updateResolverState(s)
18}

當 Name Resolver 解析完 target 之後,會透過 balancer wrapper 產生的 watcher 通知 Balancer 最新的 Addresses。

![gRPC-Dial-1.png]({{ site.url }}/assets/images/gRPC-Dial-1.png)

Balancer 是 gRPC 實現 load-balacing 要角,它最主要負責 Handle connection 和 addresses 的狀態變化(例如當 Addresses 更新時則更新連線),以及在後續 transport 階段決定該選擇哪個 server connection(又稱為 balancer policy)。 gRPC Load Balance 採用 External Load Balancing Service,讓使用者可以選擇其他實作的 Balancer。

Balancer takes input from gRPC, manages SubConns, and collects and aggregates the connectivity states. It also generates and updates the Picker used by gRPC to pick SubConns for RPCs.

SubConn represents a gRPC sub connection. Each sub connection contains a list of addresses. gRPC will try to connect to them (in sequence), and stop trying the remainder once one connection is successful.

 1func (cc *ClientConn) updateResolverState(s resolver.State) error {
 2  if cc.dopts.balancerBuilder == nil {
 3    if isGRPCLB {
 4			newBalancerName = grpclbName
 5		} else if cc.sc != nil && cc.sc.LB != nil {
 6			newBalancerName = *cc.sc.LB
 7		} else {
 8		  // Default use pick first balancer(pick_first.go)
 9			newBalancerName = PickFirstBalancerName
10		}
11		cc.switchBalancer(newBalancerName)
12  }
13  cc.balancerWrapper.updateResolverState(s)
14}
15
16// Use resolverUpdateCh to pass resolved Addresses to Balancer
17func (ccb *ccBalancerWrapper) updateResolverState(s resolver.State) {
18  ccb.resolverUpdateCh <- &s
19}
20
21func (cc *ClientConn) switchBalancer(name string) {
22  cc.balancerWrapper = newCCBalancerWrapper(cc, builder, cc.balancerBuildOpts)
23}
24
25func newCCBalancerWrapper(cc *ClientConn, b balancer.Builder, bopts balancer.BuildOptions) *ccBalancerWrapper {
26  ccb := &ccBalancerWrapper{
27		cc:               cc,
28		stateChangeQueue: newSCStateUpdateBuffer(),
29		resolverUpdateCh: make(chan *resolver.State, 1),
30		done:             make(chan struct{}),
31		subConns:         make(map[*acBalancerWrapper]struct{}),
32	}
33	// Create a watcher for name resolved events.
34	go ccb.watcher() 
35	ccb.balancer = b.Build(ccb, bopts)
36	return ccb
37}

以下擷取 watcher 接收 resolver Update 的後續處理:

 1// watcher balancer functions sequentially, so the balancer can be implemented
 2// lock-free.
 3func (ccb *ccBalancerWrapper) watcher() {
 4  for {
 5		select {
 6		case s := <-ccb.resolverUpdateCh:
 7			select {
 8			case <-ccb.done:
 9				ccb.balancer.Close()
10				return
11			default:
12			}
13			if ub, ok := ccb.balancer.(balancer.V2Balancer); ok {
14				ub.UpdateResolverState(*s)
15			} else {
16				ccb.balancer.HandleResolvedAddrs(s.Addresses, nil)
17			}
18		}
19	}
20}

這邊以簡單地 pick first balancer 為例,觀察 Balancer Handle resolved addresses 實際行為。

 1type pickfirstBalancer struct {
 2	cc balancer.ClientConn
 3	sc balancer.SubConn
 4}
 5
 6func (b *pickfirstBalancer) HandleResolvedAddrs(addrs []resolver.Address, err error) {
 7	if err != nil {
 8		grpclog.Infof("pickfirstBalancer: HandleResolvedAddrs called with error %v", err)
 9		return
10	}
11	if b.sc == nil {
12		b.sc, err = b.cc.NewSubConn(addrs, balancer.NewSubConnOptions{})
13		if err != nil {
14			grpclog.Errorf("pickfirstBalancer: failed to NewSubConn: %v", err)
15			return
16		}
17		b.cc.UpdateBalancerState(connectivity.Idle, &picker{sc: b.sc})
18		b.sc.Connect() // Connect to server (b.sc is balancer_conn_wrapper)
19	} else {
20		b.sc.UpdateAddresses(addrs)
21		b.sc.Connect()
22	}
23}

上面可以看到如果沒有 subConn ,則會透過 balancer_conn_wrapper 通知 ClientConn 新建一條 SubConn,接著觸發 Connnect。

1func (ac *addrConn) connect() error {
2  ac.updateConnectivityState(connectivity.Connecting)
3	ac.mu.Unlock()
4
5	// Start a goroutine connecting to the server asynchronously.
6	go ac.resetTransport()
7}

此時就會另起 goroutine 來正式與 Server 建立連線。 ![gRPC-Dial-2.png]({{ site.url }}/assets/images/gRPC-Dial-2.png)

resetTransport 是一個無限循環的迴圈,意味著如果 Server 端異常導致 disconnect 時,client 端會重新嘗試連線,直到連線成功或是 connectivity.Shutdown 為止。

 1func (ac *addrConn) resetTransport() {
 2  // Reconnect forever
 3	for i := 0; ; i++ {
 4		if i > 0 {
 5			ac.cc.resolveNow(resolver.ResolveNowOption{})
 6		}
 7
 8		ac.mu.Lock()
 9		if ac.state == connectivity.Shutdown {
10			ac.mu.Unlock()
11			return
12		}
13
14		newTr, addr, reconnect, err := ac.tryAllAddrs(addrs, connectDeadline)
15		if err != nil {
16			ac.mu.Lock()
17			if ac.state == connectivity.Shutdown {
18				ac.mu.Unlock()
19				return
20			}
21			ac.updateConnectivityState(connectivity.TransientFailure)
22
23			// Backoff.
24			b := ac.resetBackoff
25			ac.mu.Unlock()
26
27      // wait and reconnect
28		  timer := time.NewTimer(backoffFor)
29			select {
30			case <-timer.C:
31				ac.mu.Lock()
32				ac.backoffIdx++
33				ac.mu.Unlock()
34			case <-b:
35				timer.Stop()
36			case <-ac.ctx.Done():
37				timer.Stop()
38				return
39			}
40			continue
41		}
42
43		ac.mu.Lock()
44		if ac.state == connectivity.Shutdown {
45			newTr.Close()
46			ac.mu.Unlock()
47			return
48		}
49		ac.curAddr = addr
50		ac.transport = newTr
51		ac.backoffIdx = 0
52
53    // A connection with Ready state can be picked
54    if !healthcheckManagingState {
55			ac.updateConnectivityState(connectivity.Ready)
56		}
57		ac.mu.Unlock()
58		
59		// Block until the created transport is down. And when this happens,
60		// we restart from the top of the addr list.
61		<-reconnect.Done()
62		hcancel()
63}

如果在一開始的 grpc.Dial 額外設置 grpc.WithBlock,則會等到確認 connectivity.Ready 後才會返回。不然在預設狀態下是 non-blocking ,讓 Client 在等待連線成功之餘可以做更多事情。

最後可以透過實驗來觀察 gRPC log,對流程有更深刻的了解。只要加入 environment variable GRPC_GO_LOG_VERBOSITY_LEVEL=99 GRPC_GO_LOG_SEVERITY_LEVEL=info

就會看到在無法與 server 連線的情況下, client 會不斷地進行反覆重新連線行為。

![gRPC-Dial-3.png]({{ site.url }}/assets/images/gRPC-Dial-3.png)

References

  1. https://github.com/grpc/grpc/blob/master/doc/load-balancing.md