Performance issue with "deeply" nested structures

Problem
======

We (kubernetes api-machinery folks) have found a performance issue by accident in protobuf serialization. The issue appears to be a quadratic complexity on the depth of objects, apparently because "Size()" is being called repeatedly. We've been lucky to discover this problem because of a new recursive structure, which while profiling, aggregated the cumulative time under the same function signature, quickly raising a red-flag. We do believe though that the same performance hit impacts serialization of all objects, but just happened to be distributed across many different types, hiding the problem when profiling.

Reproduce
========

Here are some steps to reproduce, a very simple proto file:
```proto
syntax = "proto2";

message test {
  map<string, test> tests = 1;
}
```

Generate marshaling code:
```bash
protoc --gofast_out=. *.proto
```

Some benchmark tests:
```golang
package test

import (
	"testing"
)

var result []byte

func generateWide(width int) *Test {
	test := Test{
		Tests: map[string]*Test{},
	}
	for i := 0; i < width; i++ {
		test.Tests[string(int('a')+i)] = nil
	}
	return &test
}

func generateDeep(depth int) *Test {
	test := Test{
		Tests: nil,
	}
	current := &test
	for i := 0; i < depth; i++ {
		current.Tests = map[string]*Test{
			string(int('a') + i): &Test{
				Tests: nil,
			},
		}
		current = current.Tests[string(int('a')+i)]
	}
	return &test
}

func BenchmarkWideProtobuf1(b *testing.B) {
	test := generateWide(1)
	var data []byte
	for n := 0; n < b.N; n++ {
		data, _ = test.Marshal()
	}
	result = data
}

func BenchmarkWideProtobuf2(b *testing.B) {
	test := generateWide(2)
	var data []byte
	for n := 0; n < b.N; n++ {
		data, _ = test.Marshal()
	}
	result = data
}

func BenchmarkWideProtobuf5(b *testing.B) {
	test := generateWide(5)
	var data []byte
	for n := 0; n < b.N; n++ {
		data, _ = test.Marshal()
	}
	result = data
}

func BenchmarkWideProtobuf10(b *testing.B) {
	test := generateWide(10)
	var data []byte
	for n := 0; n < b.N; n++ {
		data, _ = test.Marshal()
	}
	result = data
}

func BenchmarkWideProtobuf20(b *testing.B) {
	test := generateWide(20)
	var data []byte
	for n := 0; n < b.N; n++ {
		data, _ = test.Marshal()
	}
	result = data
}

func BenchmarkWideProtobuf50(b *testing.B) {
	test := generateWide(50)
	var data []byte
	for n := 0; n < b.N; n++ {
		data, _ = test.Marshal()
	}
	result = data
}

func BenchmarkWideProtobuf100(b *testing.B) {
	test := generateWide(100)
	var data []byte
	for n := 0; n < b.N; n++ {
		data, _ = test.Marshal()
	}
	result = data
}

func BenchmarkDeepProtobuf1(b *testing.B) {
	test := generateDeep(1)
	var data []byte
	for n := 0; n < b.N; n++ {
		data, _ = test.Marshal()
	}
	result = data
}

func BenchmarkDeepProtobuf2(b *testing.B) {
	test := generateDeep(2)
	var data []byte
	for n := 0; n < b.N; n++ {
		data, _ = test.Marshal()
	}
	result = data
}

func BenchmarkDeepProtobuf5(b *testing.B) {
	test := generateDeep(5)
	var data []byte
	for n := 0; n < b.N; n++ {
		data, _ = test.Marshal()
	}
	result = data
}

func BenchmarkDeepProtobuf10(b *testing.B) {
	test := generateDeep(10)
	var data []byte
	for n := 0; n < b.N; n++ {
		data, _ = test.Marshal()
	}
	result = data
}

func BenchmarkDeepProtobuf20(b *testing.B) {
	test := generateDeep(20)
	var data []byte
	for n := 0; n < b.N; n++ {
		data, _ = test.Marshal()
	}
	result = data
}

func BenchmarkDeepProtobuf50(b *testing.B) {
	test := generateDeep(50)
	var data []byte
	for n := 0; n < b.N; n++ {
		data, _ = test.Marshal()
	}
	result = data
}

func BenchmarkDeepProtobuf100(b *testing.B) {
	test := generateDeep(100)
	var data []byte
	for n := 0; n < b.N; n++ {
		data, _ = test.Marshal()
	}
	result = data
}
```

Produces the following results:
```bash
> go test -bench=.
goos: linux
goarch: amd64
BenchmarkWideProtobuf1-12      	10000000	       144 ns/op
BenchmarkWideProtobuf2-12      	10000000	       207 ns/op
BenchmarkWideProtobuf5-12      	 3000000	       406 ns/op
BenchmarkWideProtobuf10-12     	 2000000	       762 ns/op
BenchmarkWideProtobuf20-12     	 1000000	      1408 ns/op
BenchmarkWideProtobuf50-12     	  500000	      3569 ns/op
BenchmarkWideProtobuf100-12    	  200000	      7289 ns/op
BenchmarkDeepProtobuf1-12      	10000000	       175 ns/op
BenchmarkDeepProtobuf2-12      	 3000000	       446 ns/op
BenchmarkDeepProtobuf5-12      	  500000	      2012 ns/op
BenchmarkDeepProtobuf10-12     	  200000	      6477 ns/op
BenchmarkDeepProtobuf20-12     	   50000	     23164 ns/op
BenchmarkDeepProtobuf50-12     	   10000	    144829 ns/op
BenchmarkDeepProtobuf100-12    	    2000	    592502 ns/op
PASS
ok  	_/usr/local/google/home/apelisse/code/proto-test	22.916s
```

the quadratic complexity is very obvious here.

Solutions
=======

Since sub-field sizes in protobufs are varints, and to avoid copying objects, memoization would probably be the best solution (I could think of). I did notice that `protoc` generates a field name `XXX_cachesize` that could definitely be used, though ideally we'd want per-serialization memoization rather than per-object. Also, we are generating code from go-types rather than `.proto` (AFAICT), which means that we can't easily (nor want to) embed extra fields in our own types.

Issue is tracked in kubernetes in https://github.com/kubernetes/kubernetes/issues/76219.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Performance issue with "deeply" nested structures #559

Problem

Reproduce

Solutions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Performance issue with "deeply" nested structures #559

Description

Problem

Reproduce

Solutions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions