Skip to content

Performance issue with "deeply" nested structures #559

@apelisse

Description

@apelisse

Problem

We (kubernetes api-machinery folks) have found a performance issue by accident in protobuf serialization. The issue appears to be a quadratic complexity on the depth of objects, apparently because "Size()" is being called repeatedly. We've been lucky to discover this problem because of a new recursive structure, which while profiling, aggregated the cumulative time under the same function signature, quickly raising a red-flag. We do believe though that the same performance hit impacts serialization of all objects, but just happened to be distributed across many different types, hiding the problem when profiling.

Reproduce

Here are some steps to reproduce, a very simple proto file:

syntax = "proto2";

message test {
  map<string, test> tests = 1;
}

Generate marshaling code:

protoc --gofast_out=. *.proto

Some benchmark tests:

package test

import (
	"testing"
)

var result []byte

func generateWide(width int) *Test {
	test := Test{
		Tests: map[string]*Test{},
	}
	for i := 0; i < width; i++ {
		test.Tests[string(int('a')+i)] = nil
	}
	return &test
}

func generateDeep(depth int) *Test {
	test := Test{
		Tests: nil,
	}
	current := &test
	for i := 0; i < depth; i++ {
		current.Tests = map[string]*Test{
			string(int('a') + i): &Test{
				Tests: nil,
			},
		}
		current = current.Tests[string(int('a')+i)]
	}
	return &test
}

func BenchmarkWideProtobuf1(b *testing.B) {
	test := generateWide(1)
	var data []byte
	for n := 0; n < b.N; n++ {
		data, _ = test.Marshal()
	}
	result = data
}

func BenchmarkWideProtobuf2(b *testing.B) {
	test := generateWide(2)
	var data []byte
	for n := 0; n < b.N; n++ {
		data, _ = test.Marshal()
	}
	result = data
}

func BenchmarkWideProtobuf5(b *testing.B) {
	test := generateWide(5)
	var data []byte
	for n := 0; n < b.N; n++ {
		data, _ = test.Marshal()
	}
	result = data
}

func BenchmarkWideProtobuf10(b *testing.B) {
	test := generateWide(10)
	var data []byte
	for n := 0; n < b.N; n++ {
		data, _ = test.Marshal()
	}
	result = data
}

func BenchmarkWideProtobuf20(b *testing.B) {
	test := generateWide(20)
	var data []byte
	for n := 0; n < b.N; n++ {
		data, _ = test.Marshal()
	}
	result = data
}

func BenchmarkWideProtobuf50(b *testing.B) {
	test := generateWide(50)
	var data []byte
	for n := 0; n < b.N; n++ {
		data, _ = test.Marshal()
	}
	result = data
}

func BenchmarkWideProtobuf100(b *testing.B) {
	test := generateWide(100)
	var data []byte
	for n := 0; n < b.N; n++ {
		data, _ = test.Marshal()
	}
	result = data
}

func BenchmarkDeepProtobuf1(b *testing.B) {
	test := generateDeep(1)
	var data []byte
	for n := 0; n < b.N; n++ {
		data, _ = test.Marshal()
	}
	result = data
}

func BenchmarkDeepProtobuf2(b *testing.B) {
	test := generateDeep(2)
	var data []byte
	for n := 0; n < b.N; n++ {
		data, _ = test.Marshal()
	}
	result = data
}

func BenchmarkDeepProtobuf5(b *testing.B) {
	test := generateDeep(5)
	var data []byte
	for n := 0; n < b.N; n++ {
		data, _ = test.Marshal()
	}
	result = data
}

func BenchmarkDeepProtobuf10(b *testing.B) {
	test := generateDeep(10)
	var data []byte
	for n := 0; n < b.N; n++ {
		data, _ = test.Marshal()
	}
	result = data
}

func BenchmarkDeepProtobuf20(b *testing.B) {
	test := generateDeep(20)
	var data []byte
	for n := 0; n < b.N; n++ {
		data, _ = test.Marshal()
	}
	result = data
}

func BenchmarkDeepProtobuf50(b *testing.B) {
	test := generateDeep(50)
	var data []byte
	for n := 0; n < b.N; n++ {
		data, _ = test.Marshal()
	}
	result = data
}

func BenchmarkDeepProtobuf100(b *testing.B) {
	test := generateDeep(100)
	var data []byte
	for n := 0; n < b.N; n++ {
		data, _ = test.Marshal()
	}
	result = data
}

Produces the following results:

> go test -bench=.
goos: linux
goarch: amd64
BenchmarkWideProtobuf1-12      	10000000	       144 ns/op
BenchmarkWideProtobuf2-12      	10000000	       207 ns/op
BenchmarkWideProtobuf5-12      	 3000000	       406 ns/op
BenchmarkWideProtobuf10-12     	 2000000	       762 ns/op
BenchmarkWideProtobuf20-12     	 1000000	      1408 ns/op
BenchmarkWideProtobuf50-12     	  500000	      3569 ns/op
BenchmarkWideProtobuf100-12    	  200000	      7289 ns/op
BenchmarkDeepProtobuf1-12      	10000000	       175 ns/op
BenchmarkDeepProtobuf2-12      	 3000000	       446 ns/op
BenchmarkDeepProtobuf5-12      	  500000	      2012 ns/op
BenchmarkDeepProtobuf10-12     	  200000	      6477 ns/op
BenchmarkDeepProtobuf20-12     	   50000	     23164 ns/op
BenchmarkDeepProtobuf50-12     	   10000	    144829 ns/op
BenchmarkDeepProtobuf100-12    	    2000	    592502 ns/op
PASS
ok  	_/usr/local/google/home/apelisse/code/proto-test	22.916s

the quadratic complexity is very obvious here.

Solutions

Since sub-field sizes in protobufs are varints, and to avoid copying objects, memoization would probably be the best solution (I could think of). I did notice that protoc generates a field name XXX_cachesize that could definitely be used, though ideally we'd want per-serialization memoization rather than per-object. Also, we are generating code from go-types rather than .proto (AFAICT), which means that we can't easily (nor want to) embed extra fields in our own types.

Issue is tracked in kubernetes in kubernetes/kubernetes#76219.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions