Show HN: Alphareader: a custom separator and endline file reader in Python

3 points by canimus 6 years ago · 3 comments

Reader

eesmith 6 years ago

Some tweaks:

#1:

        elif not isinstance(fn_transform, FunctionType) or not isinstance(fn_transform, LambdaType):
            raise TypeError('Transformation parameter should be a function or lambda i.e. fn = lambda x: x.replace(a,b)')

What about using callable()? There's no reason you couldn't use a functor, for example.

#2:

        curr = file_handle.read(chunk_size)
        if encoding:
            curr = curr.decode(encoding)

That assumes a single-byte encoding. Consider a multi-byte encoding where the chunk_size reads only part of the character:

    >>> s="ü"
    >>> s.encode("utf8")[:1].decode("utf8")
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc3 in position 0: unexpected end of data

#3:

        if chr(terminator) in chunk:
            lines = chunk.split(chr(terminator))

Might want to compute chr(terminator) once, rather than re-evaluate it each time.

#4:

        try:
            transformations = iter(fn_transform)
            yield list(map(lambda x: reduce(lambda a,b: b(a), fn_transform, x), columns))
        except TypeError:
            yield list(map(fn_transform, columns))

Since you've already checked for the two cases, set a flag to remember what fn_transform contains. Then branch on that, rather than use the try/except.

Otherwise, consider what happens if one of the callables raises a TypeError because of an internal error, rather than because of an expected structural mismatch.

canimusOP 6 years ago

Thank you @eesmith. Comments appreciated, and PRs to the repo as well. ;-) The multi-byte is great catch! I made the wrong assumption, on single byte separators. Perhaps a library limitation if the we want to keep the logic simple. Ideas on the fix?
- eesmith 6 years ago
  
  If it's a fixed-width encoding, nudge the read size to a multiple of that encoding size.
  If it's utf-8, keep the block reads in byte space, search for the terminator as a byte sequence, and only decode after you find the terminator.
  Otherwise, throw your hands up in the air and give up?
  Catch the UnicodeDecodeError, use err.start, and see if it's close to the end of the block? If it is, then do another read?
  BTW, you can mitigate some Python overhead by using a larger read size.

Settings

Show HN: Alphareader: a custom separator and endline file reader in Python

Keyboard Shortcuts