-
Notifications
You must be signed in to change notification settings - Fork 810
Description
Problem
We (kubernetes api-machinery folks) have found a performance issue by accident in protobuf serialization. The issue appears to be a quadratic complexity on the depth of objects, apparently because "Size()" is being called repeatedly. We've been lucky to discover this problem because of a new recursive structure, which while profiling, aggregated the cumulative time under the same function signature, quickly raising a red-flag. We do believe though that the same performance hit impacts serialization of all objects, but just happened to be distributed across many different types, hiding the problem when profiling.
Reproduce
Here are some steps to reproduce, a very simple proto file:
syntax = "proto2";
message test {
map<string, test> tests = 1;
}
Generate marshaling code:
protoc --gofast_out=. *.proto
Some benchmark tests:
package test
import (
"testing"
)
var result []byte
func generateWide(width int) *Test {
test := Test{
Tests: map[string]*Test{},
}
for i := 0; i < width; i++ {
test.Tests[string(int('a')+i)] = nil
}
return &test
}
func generateDeep(depth int) *Test {
test := Test{
Tests: nil,
}
current := &test
for i := 0; i < depth; i++ {
current.Tests = map[string]*Test{
string(int('a') + i): &Test{
Tests: nil,
},
}
current = current.Tests[string(int('a')+i)]
}
return &test
}
func BenchmarkWideProtobuf1(b *testing.B) {
test := generateWide(1)
var data []byte
for n := 0; n < b.N; n++ {
data, _ = test.Marshal()
}
result = data
}
func BenchmarkWideProtobuf2(b *testing.B) {
test := generateWide(2)
var data []byte
for n := 0; n < b.N; n++ {
data, _ = test.Marshal()
}
result = data
}
func BenchmarkWideProtobuf5(b *testing.B) {
test := generateWide(5)
var data []byte
for n := 0; n < b.N; n++ {
data, _ = test.Marshal()
}
result = data
}
func BenchmarkWideProtobuf10(b *testing.B) {
test := generateWide(10)
var data []byte
for n := 0; n < b.N; n++ {
data, _ = test.Marshal()
}
result = data
}
func BenchmarkWideProtobuf20(b *testing.B) {
test := generateWide(20)
var data []byte
for n := 0; n < b.N; n++ {
data, _ = test.Marshal()
}
result = data
}
func BenchmarkWideProtobuf50(b *testing.B) {
test := generateWide(50)
var data []byte
for n := 0; n < b.N; n++ {
data, _ = test.Marshal()
}
result = data
}
func BenchmarkWideProtobuf100(b *testing.B) {
test := generateWide(100)
var data []byte
for n := 0; n < b.N; n++ {
data, _ = test.Marshal()
}
result = data
}
func BenchmarkDeepProtobuf1(b *testing.B) {
test := generateDeep(1)
var data []byte
for n := 0; n < b.N; n++ {
data, _ = test.Marshal()
}
result = data
}
func BenchmarkDeepProtobuf2(b *testing.B) {
test := generateDeep(2)
var data []byte
for n := 0; n < b.N; n++ {
data, _ = test.Marshal()
}
result = data
}
func BenchmarkDeepProtobuf5(b *testing.B) {
test := generateDeep(5)
var data []byte
for n := 0; n < b.N; n++ {
data, _ = test.Marshal()
}
result = data
}
func BenchmarkDeepProtobuf10(b *testing.B) {
test := generateDeep(10)
var data []byte
for n := 0; n < b.N; n++ {
data, _ = test.Marshal()
}
result = data
}
func BenchmarkDeepProtobuf20(b *testing.B) {
test := generateDeep(20)
var data []byte
for n := 0; n < b.N; n++ {
data, _ = test.Marshal()
}
result = data
}
func BenchmarkDeepProtobuf50(b *testing.B) {
test := generateDeep(50)
var data []byte
for n := 0; n < b.N; n++ {
data, _ = test.Marshal()
}
result = data
}
func BenchmarkDeepProtobuf100(b *testing.B) {
test := generateDeep(100)
var data []byte
for n := 0; n < b.N; n++ {
data, _ = test.Marshal()
}
result = data
}
Produces the following results:
> go test -bench=.
goos: linux
goarch: amd64
BenchmarkWideProtobuf1-12 10000000 144 ns/op
BenchmarkWideProtobuf2-12 10000000 207 ns/op
BenchmarkWideProtobuf5-12 3000000 406 ns/op
BenchmarkWideProtobuf10-12 2000000 762 ns/op
BenchmarkWideProtobuf20-12 1000000 1408 ns/op
BenchmarkWideProtobuf50-12 500000 3569 ns/op
BenchmarkWideProtobuf100-12 200000 7289 ns/op
BenchmarkDeepProtobuf1-12 10000000 175 ns/op
BenchmarkDeepProtobuf2-12 3000000 446 ns/op
BenchmarkDeepProtobuf5-12 500000 2012 ns/op
BenchmarkDeepProtobuf10-12 200000 6477 ns/op
BenchmarkDeepProtobuf20-12 50000 23164 ns/op
BenchmarkDeepProtobuf50-12 10000 144829 ns/op
BenchmarkDeepProtobuf100-12 2000 592502 ns/op
PASS
ok _/usr/local/google/home/apelisse/code/proto-test 22.916s
the quadratic complexity is very obvious here.
Solutions
Since sub-field sizes in protobufs are varints, and to avoid copying objects, memoization would probably be the best solution (I could think of). I did notice that protoc
generates a field name XXX_cachesize
that could definitely be used, though ideally we'd want per-serialization memoization rather than per-object. Also, we are generating code from go-types rather than .proto
(AFAICT), which means that we can't easily (nor want to) embed extra fields in our own types.
Issue is tracked in kubernetes in kubernetes/kubernetes#76219.