Skip to content

Conversation

amjith
Copy link
Member

@amjith amjith commented Mar 14, 2021

Description

Fixes #89

Turns out sqlite3 library for Python uses utf-8 by default which works fine since Sqlite3 stores everything as utf-8. But as you pointed out there could be invalid unicode values that can sneak in. Thankfully the python library allows overriding of the decoder that can be used. So I've caught the exception and applied latin-1 decoding.

Checklist

  • I've added this contribution to the CHANGELOG.md file.

@amjith amjith requested review from j-bennet and zzl0 March 14, 2021 20:14
@codecov-io
Copy link

Codecov Report

Merging #113 (b71fb3d) into master (0baeadf) will increase coverage by 0.69%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #113      +/-   ##
==========================================
+ Coverage   62.34%   63.04%   +0.69%     
==========================================
  Files          23       23              
  Lines        1936     1986      +50     
==========================================
+ Hits         1207     1252      +45     
- Misses        729      734       +5     
Impacted Files Coverage Δ
litecli/main.py 48.60% <ø> (+0.46%) ⬆️
litecli/packages/parseutils.py 96.87% <ø> (ø)
litecli/packages/special/iocommands.py 54.51% <ø> (+3.73%) ⬆️
litecli/sqlexecute.py 70.40% <100.00%> (+1.49%) ⬆️
litecli/key_bindings.py 14.58% <0.00%> (-2.09%) ⬇️
litecli/clibuffer.py 33.33% <0.00%> (-1.97%) ⬇️
litecli/packages/special/main.py 88.23% <0.00%> (+0.93%) ⬆️
litecli/completion_refresher.py 76.56% <0.00%> (+1.98%) ⬆️
litecli/packages/special/dbcommands.py 38.62% <0.00%> (+2.19%) ⬆️
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a1a01c1...b71fb3d. Read the comment docs.

Copy link
Contributor

@zzl0 zzl0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@zzl0 zzl0 merged commit 04c15ed into master Mar 15, 2021
@zzl0
Copy link
Contributor

zzl0 commented Mar 15, 2021

Update: this example shows the simplified version is better (in my opionion):

>>> b'\xf0\x9f\x98\x8a\x80abc'.decode('utf-8', 'backslashreplace')
'😊\\x80abc'
>>> b'\xf0\x9f\x98\x8a\x80abc'.decode('latin-1')
\x9f\x98\x8a\x80abc'

Just realized the utf8_resilient_decoder function could be simplified, e.g.:

def utf8_resilient_decoder(b):
    return b.decode("utf-8", "backslashreplace")

Below examples are from https://docs.python.org/3/howto/unicode.html#the-string-type

>>> b'\x80abc'.decode("utf-8", "strict")  
Traceback (most recent call last):
    ...
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0:
  invalid start byte
>>> b'\x80abc'.decode("utf-8", "replace")
'\ufffdabc'
>>> b'\x80abc'.decode("utf-8", "backslashreplace")
'\\x80abc'
>>> b'\x80abc'.decode("utf-8", "ignore")
'abc'

@amjith amjith deleted the utf-8 branch March 15, 2021 13:44
@amjith
Copy link
Member Author

amjith commented Mar 15, 2021

Yup. I like your solution better. I'll change the decoder to use the backslashreplace.

Thanks for the feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

TabularOutputFormatter chokes on values that can't be converted to UTF-8 if it's a text column

3 participants