UTF-8 Decoding Error #113

amjith · 2021-03-14T20:14:14Z

Description

Fixes #89

Turns out sqlite3 library for Python uses utf-8 by default which works fine since Sqlite3 stores everything as utf-8. But as you pointed out there could be invalid unicode values that can sneak in. Thankfully the python library allows overriding of the decoder that can be used. So I've caught the exception and applied latin-1 decoding.

Checklist

I've added this contribution to the CHANGELOG.md file.

codecov-io · 2021-03-14T20:15:05Z

Codecov Report

Merging #113 (b71fb3d) into master (0baeadf) will increase coverage by 0.69%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master     #113      +/-   ##
==========================================
+ Coverage   62.34%   63.04%   +0.69%     
==========================================
  Files          23       23              
  Lines        1936     1986      +50     
==========================================
+ Hits         1207     1252      +45     
- Misses        729      734       +5

Impacted Files	Coverage Δ
litecli/main.py	`48.60% <ø> (+0.46%)`	⬆️
litecli/packages/parseutils.py	`96.87% <ø> (ø)`
litecli/packages/special/iocommands.py	`54.51% <ø> (+3.73%)`	⬆️
litecli/sqlexecute.py	`70.40% <100.00%> (+1.49%)`	⬆️
litecli/key_bindings.py	`14.58% <0.00%> (-2.09%)`	⬇️
litecli/clibuffer.py	`33.33% <0.00%> (-1.97%)`	⬇️
litecli/packages/special/main.py	`88.23% <0.00%> (+0.93%)`	⬆️
litecli/completion_refresher.py	`76.56% <0.00%> (+1.98%)`	⬆️
litecli/packages/special/dbcommands.py	`38.62% <0.00%> (+2.19%)`	⬆️
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a1a01c1...b71fb3d. Read the comment docs.

zzl0

LGTM

zzl0 · 2021-03-15T03:12:04Z

Update: this example shows the simplified version is better (in my opionion):

>>> b'\xf0\x9f\x98\x8a\x80abc'.decode('utf-8', 'backslashreplace')
'😊\\x80abc'
>>> b'\xf0\x9f\x98\x8a\x80abc'.decode('latin-1')
'ð\x9f\x98\x8a\x80abc'

Just realized the utf8_resilient_decoder function could be simplified, e.g.:

def utf8_resilient_decoder(b):
    return b.decode("utf-8", "backslashreplace")

Below examples are from https://docs.python.org/3/howto/unicode.html#the-string-type

>>> b'\x80abc'.decode("utf-8", "strict")  
Traceback (most recent call last):
    ...
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0:
  invalid start byte
>>> b'\x80abc'.decode("utf-8", "replace")
'\ufffdabc'
>>> b'\x80abc'.decode("utf-8", "backslashreplace")
'\\x80abc'
>>> b'\x80abc'.decode("utf-8", "ignore")
'abc'

amjith · 2021-03-15T13:45:01Z

Yup. I like your solution better. I'll change the decoder to use the backslashreplace.

Thanks for the feedback.

amjith added 3 commits March 14, 2021 13:09

Add a test to reproduce the invalid utf-8 error.

c2f914a

Fix the utf-8 decoding error.

5b86afa

Update changelog.

b71fb3d

amjith requested review from j-bennet and zzl0 March 14, 2021 20:14

amjith mentioned this pull request Mar 14, 2021

TabularOutputFormatter chokes on values that can't be converted to UTF-8 if it's a text column #89

Closed

zzl0 approved these changes Mar 15, 2021

View reviewed changes

zzl0 merged commit 04c15ed into master Mar 15, 2021

amjith deleted the utf-8 branch March 15, 2021 13:44

dependabot bot mentioned this pull request Mar 16, 2021

Bump litecli from 1.5.0 to 1.6.0 opensafely-core/opencodelists#580

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

UTF-8 Decoding Error #113

UTF-8 Decoding Error #113

Uh oh!

amjith commented Mar 14, 2021

Uh oh!

codecov-io commented Mar 14, 2021

Uh oh!

zzl0 left a comment

Uh oh!

zzl0 commented Mar 15, 2021 •

edited

Loading

Uh oh!

amjith commented Mar 15, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

UTF-8 Decoding Error #113

UTF-8 Decoding Error #113

Uh oh!

Conversation

amjith commented Mar 14, 2021

Description

Checklist

Uh oh!

codecov-io commented Mar 14, 2021

Codecov Report

Uh oh!

zzl0 left a comment

Choose a reason for hiding this comment

Uh oh!

zzl0 commented Mar 15, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amjith commented Mar 15, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

zzl0 commented Mar 15, 2021 •

edited

Loading