Skip to content

googlecloudpubsub receiver splits logs over 4MB after removal of raw_text encoding in version 132 #42775

@dantheman39

Description

@dantheman39

Component(s)

receiver/googlecloudpubsub

What happened?

Description

googlecloudpubsub receiver splits logs over 4MB after removal of raw_text encoding in version 132. That's the gist of it.

First off, thank you all for your work here!

We're exporting Cloud Run logs via this receiver. GCP regularly emits some logs that are a little larger than 4 megabytes. Up until version 131, the googlecloudpubsub receiver was able to export each of these logs without splitting them, using the raw_text encoding.

In version 132 (see PR #41813), the raw_text encoding was removed, and the guidance is to use the text_encoding extension.

Example config change Old config:
receivers:
  googlecloudpubsub:
    project: "${env:PROJECT_ID}"
    subscription: "projects/${env:PROJECT_ID}/subscriptions/${env:PUBSUB_SUBSCRIPTION}"
    encoding: raw_text

New config:

extensions:
  text_encoding:
    encoding: utf-8

receivers:
  googlecloudpubsub:
    project: "${env:PROJECT_ID}"
    subscription: "projects/${env:PROJECT_ID}/subscriptions/${env:PUBSUB_SUBSCRIPTION}"
    encoding: text_encoding

service:
  extensions: [text_encoding]

With this change, I'm now observing that logs bigger than 4MB are split up into multiple logs, resulting in malformed JSON and drastically reducing the usefulness of the log.

Steps to Reproduce

I created a docker compose config that reproduces this issue using the pubsub emulator, a python client, and a large JSON payload. I put in some effort to make it easy to run, please let me know if you run into issues: https://github.com/dantheman39/otel-pubsub-debugging

Expected Result

Logs that are above 4MB aren't split into multiple messages.

Actual Result

Single logs above 4MB in size are split into multiple messages.

Collector version

0.135.0

Environment information

Environment

OS: Have seen on docker (in Mac OS), and ubuntu. Can give more specifics if requested.

OpenTelemetry Collector configuration

This reproduces the issue, see linked repo.

extensions:
  text_encoding:
    encoding: utf-8

receivers:
  googlecloudpubsub:
    endpoint: "pubsub:8085"
    insecure: true
    project: "${env:PROJECT_ID}"
    subscription: "projects/${env:PROJECT_ID}/subscriptions/${env:PUBSUB_SUBSCRIPTION}"
    encoding: text_encoding

processors:
  batch: {}
  resource/env:
    attributes:
      - key: deployment.environment
        value: "local"
        action: upsert

exporters:
  debug:
    use_internal_logger: false
    verbosity: detailed

service:
  extensions: [text_encoding]
  pipelines:
    logs:
      receivers: [googlecloudpubsub]
      processors: [batch, resource/env]
      exporters: [debug]

Log output

Snippets, since these are large:


Logs	{"resource logs": 2, "log records": 2}
ResourceLog #0
Resource SchemaURL:
Resource attributes:
     -> deployment.environment: Str(local)
ScopeLogs #0
ScopeLogs SchemaURL:
InstrumentationScope
LogRecord #0
ObservedTimestamp: 2025-09-18 21:35:37.781243625 +0000 UTC
Timestamp: 1970-01-01 00:00:00 +0000 UTC
SeverityText:
SeverityNumber: Unspecified(0)
Body: Str({
  "protoPayload": {
    "@type": "type.googleapis.com/google.cloud.audit.AuditLog",
    "status": {
      "message": "Execution fake-job-79gc8 has completed successfully."
    },
    "serviceName": "run.googleapis.com",
    "methodName": "/Jobs.RunJob",
    "resourceName": "namespaces/fake-project/executions/fake-job-79gc8",
    "response": {
      "metadata": {
        "name": "fake-job-79gc8",
        "namespace": "845684099668",
---TRUNCATED----



ResourceLog #1
Resource SchemaURL:
Resource attributes:
     -> deployment.environment: Str(local)
ScopeLogs #0
ScopeLogs SchemaURL:
InstrumentationScope
LogRecord #0
ObservedTimestamp: 2025-09-18 21:35:37.781243625 +0000 UTC
Timestamp: 1970-01-01 00:00:00 +0000 UTC
SeverityText:
SeverityNumber: Unspecified(0)
Body: Str(
                  {
                    "name": "VAR_11",
                    "value": "VAR_11"
                  },
                  {
                    "name": "VAR_12",
                    "value": "VAR_12"
                  },
                  {
                    "name": "VAR_13",
                    "value": "VAR_13"
                  },
                  {
                    "name": "VAR_14",
                    "value": "VAR_14"
                  },
                  {
                    "name": "VAR_15",
                    "value": "VAR_15"
                  }
                ],

Additional context

No response

Tip

React with 👍 to help prioritize this issue. Please use comments to provide useful context, avoiding +1 or me too, to help us triage it. Learn more here.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions