Comparing code completion LLMs — GitHub Copilot, Refactor.ai, and Fitten Code

15 min read Original article ↗

TL; DR: Code completion LLMs should be both brave and smart, so if you are looking for a free alternative to Copilot, Try Fitten Code.

Ming

We all love GitHub Copilot, but it comes with a price tag. Two alternatives are Refact.ai and Fitten Code, both offering free tiers. How do they compare? Who generates the best snippets, and who runs the fastest?

In this article, I will compare the three AI-powered code generation tools in 3 use cases. Let’s see how they line up.

How good are their code generations?

A good code completion model should be both brave and smart. Here’s what I mean:

  • Brave: When information is inadequate for making helpful suggestions, it should have the guts to make assumptions and to leave holes for the human to fill in.
  • Smart: It should be syntactically correct and semantically relevant.

To evaluate these attributes, some good questions are:

  1. Can the model infer usages from declarations?
  2. Can the model understand product requirements?
  3. How creatively can the model improvise?

For each of the questions, I designed a problem for the LLM to solve:

  1. Given a OOP representation of real-world identities, can the LLM deduce how they interact with each other?
  2. Given an English description of a data analysis exercise, to what degree can the LLM fulfill the requirements?
  3. When asked to draft a learning plan, is the LLM offering tasks that are incrementally challenging in difficulties and comprehensive in topics?

Now that we are aligned on the evaluation methodology, we shall continue to put our competitors to the test. I’ll be using PyCharm 2024.1.1 CE with only one plugin enabled at a time.

Understanding OOP models of real-life concepts

Daunted by long chains of suffixes like ...DelegateFactoryHelper in the object-oriented programming (OOP) world? Consider yourself lucky not working in molecular biology. While it would be a war crime to introduce a ...UtilsUtilsUtils class to your source code, in the microscopic world terminologies can go crazy. One example is “Mitogen-Activated Protein (MAP) kinase kinase kinase” (MAP3K).

To paraphrase Robbins & Cotran Pathologic Basis of Disease, MAP3K is meant for activating “MAP kinase kinase” (MAP2K) via a process called “phosphorylation”, which in turn is meant for activating “MAP kinase” (MAPK) with — again — phosphorylation. As a protein kinase itself, MAPK is meant for phosphorylating (“adding a phosphate group to”) some other proteins. This whole cascade of phosphorylation is an important signaling pathway for cells.

Press enter or click to view image in full size

(Diagram retrieved via Wikimedia Commons. No, neither do I understand most of it.)

If OOP is such a good programming paradigm, can it be used to express the intermolecular relationships around MAP3K clearly and concisely, to a point that code completion models can make good use of them?

It didn’t take me too long to translate the entities and capabilities mentioned above into Python:

class Protein:  
"""A macromolecular entity consisting of a sequence of amino acids and associated secondary structures and functional domains."""

class Enzyme(Protein):
"""Proteins that catalyzes chemical reactions."""

class Kinase(Enzyme):
"""Something that adds a phosphate group to a target ("substrate")."""
def phosphorylate(self, substrate):
"""Adds a phosphate group to the target ("substrate")."""

class ProteinKinase(Kinase):
"""A kinase that phosphorylates proteins."""
def phosphorylate(self, substrate: Protein):
"""Adds a phosphate group to the target ("substrate") protein."""

class MitogenActivatedProteinKinase(ProteinKinase):
"""a.k.a. MAPK or MAP kinase."""

class MitogenActivatedProteinKinaseKinase(ProteinKinase):
def activate(self, target: MitogenActivatedProteinKinase):
self.phosphorylate(target)

class MitogenActivatedProteinKinaseKinaseKinase(ProteinKinase):
def activate(self, target: MitogenActivatedProteinKinaseKinase):
self.phosphorylate(target)

Let’s see how our models stack up in making use of these classes and methods. I nudged them into action by adding if __name__ == '__main__' (what is this?), and here’s what I got.

GitHub Copilot captured the cascade of MAP3K->MAP2K->MAPK as I expected:

if __name__ == '__main__':
mapkkk = MitogenActivatedProteinKinaseKinaseKinase()
mapkk = MitogenActivatedProteinKinaseKinase()
mapk = MitogenActivatedProteinKinase()

mapkkk.activate(mapkk)
mapkk.activate(mapk)
print('done')

Refact.ai chose to initialize each of the classes defined without invoking any of the methods. In the process, it confused MAP3K as a activated kinase:

if __name__ == '__main__':  
# 1. Create a protein
protein = Protein()
# 2. Create a kinase
kinase = MitogenActivatedProteinKinase()
# 3. Create a phosphorylated protein
phosphorylated_protein = ProteinKinase()
# 4. Create a phosphorylated kinase
phosphorylated_kinase = MitogenActivatedProteinKinaseKinase()
# 5. Create a phosphorylated kinase kinase
phosphorylated_kinase_kinase = MitogenActivatedProteinKinaseKinaseKinase()

Fitten Code surprised me here. It chose to focus on the description of the Protein class and improvised some property initializations & inspections:

if __name__ == '__main__':
# Example usage
mammalian_target = Protein()
mammalian_target.sequence = "MA"
mammalian_target.secondary_structure = "alpha-helix"
mammalian_target.functional_domains = ["nucleotide binding", "transcription factor"]
# Create a MAPK-like kinase
mapk = MitogenActivatedProteinKinase()
# Activate the kinase on the target
mapk.activate(mammalian_target)
# Check the result
print(mammalian_target.sequence) # "MAK"
print(mammalian_target.secondary_structure) # "alpha-helix"
print(mammalian_target.functional_domains) # ["nucleotide binding", "transcription factor", "phosphorylation"]

How did the three models compare in terms of writing comments?

  • GitHub Copilot decided to be taciturn. It wrote no comments and only printed a concise “done” at the end.
  • Refact.ai commented on every line of code it wrote. However, the comments are simply translations of the code in English, without mentions of the intentions. I’m an advocate of the idea Code Tells You How, Comments Tell You Why, so I think these comments are simply redundant.
  • Fitten Code kicked it up a notch here. Not only did it comment on what it’s trying to do (“let me show you an example of how to use the code”, and “here let’s check the result”), but also wrote down the expected output with the effect.

How brave and smart were the three models?

  • GitHub Copilot’s generation syntactically compiles and semantically correct. It did everything I expected an AI to do and nothing more. We can say it’s smart but not so brave.
  • Refact.ai wrote code that compiles but was semantically wrong. Furthermore, Refact.ai shied away from invoking any methods, so it’s quite timid.
  • Fitten Code struck a good balance between correctness and helpfulness. Although I did not implement the phosphorylate methods, Fitten Code inferred from the docstring that it may contain a side effect of substrate.append("phosphorylation"). To accommodate this side effect, Fitten Code improvised data structures for Protein class. If I am developing a software from scratch, some auto-generated, proof-of-concept code is going to be delightful. It was brave and smart.

Fitten Code is definitely the winner here. Does the code compile? No (It confused mapk.phosphorylate as mapk.activate). Am I happy to see the generated code? Absolutely yes.

Cleaning and plotting data like an analyst

In an essay last year, I answered the question: How many software developers work in non-tech companies? It was a straightforward exercise in manipulating and visualizing the following dataset (via OES Query System):

Press enter or click to view image in full size

Can our code generation models reproduce the ring chart I plotted in that post?

The requirement is short enough to be put as a module-level docstring in an empty Python file:

"""
## Plot data from a spreadsheet

The file `OES_Report.xlsx` contains just one sheet, "OES Sheet". It contains a table with the following columns:
- **"Occupation (SOC code)"**: Standard Occupational Classification code of industries. Two types of entry exist:
- **Sectors**, such as "Sector 21 - Mining(21--22)" and "Sector 22 - Utilities(22--23)".
- **Finer-grained fields of work related to software development**, such as "Software Publishers(513200)" and "Computer Systems Design and Related Services(541500)". They will always end with bracketed 6-digit numbers like "(123456)".
- **"Employment(1)"**: Number of employees in the industry. For example, one entry is "67710".
- **"Annual median wage(2)"**: Median wage of the industry in the US dollars. For example, one entry is "106040".

The task is to plot a chart of for employments related to software development. In the graph, show the percentages of software-related employments contributed by each sector of industry. Use a ring chart if you can; if you can not, you can resort to a pie chart.

Caveats:
- You need to skip the first 5 rows of the table, which contain the column headings.
- Footers exist, so you need to skip them upon reading as well.
- You will be judged on how pretty you can make the graph, so be artistic.
"""

I then gave the keyboard to each of our three models, hitting tab and return as needed. Whenever I see plt.show() generated, I stop and run the script. Here's what I got.

Get Ming’s stories in your inbox

Join Medium for free to get updates from this writer.

GitHub Copilot plotted a graph of percentages of employments by software-related fields, not them across non-tech industries as I intended:

import pandas as pd
import matplotlib.pyplot as plt

# Read the data
df = pd.read_excel('OES_Report.xlsx', sheet_name='OES Sheet',
skiprows=5, skipfooter=5)

# Filter out software-related employments
df = df[df['Occupation (SOC code)'].str.contains(r'\(\d{6}\)')]
df['Employment(1)'] = df['Employment(1)'].str.replace(',', '').astype(int)

# Calculate the total employment in software-related fields
total_software_employment = df['Employment(1)'].sum()

# Calculate the percentage of employment contributed by each sector
df['Percentage'] = df['Employment(1)'] / total_software_employment * 100

# Plot the chart
fig, ax = plt.subplots()
ax.pie(df['Percentage'], labels=df['Occupation (SOC code)'],
autopct='%1.1f%%', startangle=90)
ax.axis('equal') # Equal aspect ratio ensures that pie is drawn as a circle.

plt.show()

(Pie chart that GitHub Copilot plotted. Yes, the labels are cut off in the graph.)

Observations:

  • GitHub Copilot wrongly assumed that the footers span exactly 5 rows as the headers do, while my prompt never mentioned so.
  • There was an excessive .str access in df['Employment(1)'].str.replace. This is understandable, because the prompt provided too few information about the data types. However, it should have been able to infer from the example "67710" that the data did not include thousand separators.
  • Re: “Equal aspect ratio ensures that pie is drawn as a circle” — I didn’t even realize that was an concern. TIL.

On the contrary, Refact.ai decided to plot a pie chart for all entries in the dataset, regardless of whether they are sectors or sub-sector fields. That’s a faulty parallelism:

import matplotlib.pyplot as plt
import pandas as pd

# Read the data from the spreadsheet
df = pd.read_excel("OES_Report.xlsx", sheet_name="OES Sheet",
skiprows=6, header=None)

# Create a pie chart of the data
plt.pie(df[1], labels=df[0], autopct="%1.1f%%")
plt.title("Employments related to software development")
plt.show()

Press enter or click to view image in full size

Observations:

  • Refact.ai wrote the fewest line of code. Considered together with its behavior in the biology problem, Refact.ai seems to prefer not calling methods.
  • It is the only model among the three that added a title to the plot. Kudos to it.
  • The header row was skipped, and all columns were referred to by their numerical IDs. That makes the code cryptic to me. I would rather Refact.ai didn’t do that.

Again, Fitten Code generated the richest code snippet for our challenge:

import pandas as pd
import matplotlib.pyplot as plt

# Read data from the spreadsheet
df = pd.read_excel("OES_Report.xlsx", sheet_name="OES Sheet", skiprows=5)

# Filter data for software-related occupations
software_df = df[df["Occupation (SOC code)"].str.contains(
r"\(5\d{3}\)|\(5\d{4}\)|\(5\d{5}\)")]

# Calculate the total number of software-related employments
total_software_employments = software_df["Employment(1)"].sum()

# Calculate the percentages of software-related employments contributed by each sector
sectors = software_df["Occupation (SOC code)"].str.split(
r"\(", expand=True)[0].unique()
sector_software_employments = {}
for sector in sectors:
sector_software_employments[sector] = software_df[software_df["Occupation (SOC code)"].str.startswith(
sector)]["Employment(1)"].sum()
sector_software_percentages = {
sector: sector_software_employments[sector] / total_software_employments for sector in sectors}

# Plot the data
fig, ax = plt.subplots(figsize=(8, 8))

# Ring chart
if len(sectors) <= 10:
ax.pie(sector_software_percentages.values(
), labels=sector_software_percentages.keys(), autopct='%1.1f%%', startangle=90)
ax.axis('equal')
# Pie chart
else:
ax.pie(sector_software_percentages.values(
), labels=sector_software_percentages.keys(), autopct='%1.1f%%', startangle=90)

plt.show()

Press enter or click to view image in full size

Observations:

  • In filtering data for software-related occupations, Fitten Code looked for SOC IDs that start with “5” and are 4~6 digits long. It seems to have inferred from my two examples that all software-related SOC IDs will begin with “5”, which isn’t technically true, but I hold myself responsible for not providing contrasting examples in my prompt.
  • Fitten Code is the only model attempted to crunch the numbers as I instructed. Curiously, it did not understand the numerical ranges expressed like (00--99); instead, it chose to match sectors and fields by their names. Once again, that's a bold assumption that may delightfully surprise its users.
  • Similarly, Fitten Code is the only model that at least tried to plot a ring chart, though the only effort it devoted was an if statement and a ax.axis('equal').
  • It has been inventive in making up the criterion (“if there are more than 10 sectors to draw”) by which to switch from ring charts to pies. Although I doubt a pie chart would be any more readable than a ring, this is nonetheless a good skeleton code for inspired programmers to fill out with more sensible implementations.

Again, we see that Fitten Code is the most brave-and-smart candidate among them all. This makes me wonder, would Fitten Code project (in the psychological sense) its boldness and high-intelligence onto me, the human user?

Making project roadmaps

We’ve tried generating code from code prompts and generating code from natural-language instructions. For the dessert in our three-course dinner, let’s try generating natural-language content from a natural-language prompt.

Sitting in my codebase is this README file:

# Rusty

My Rust-lang playground. This program serves no practical purpose. It is only meant to help me learn how to code in Rust.

(“I am Ferris the crab, unofficial mascot for Rust.” Retrieved from rustacean.net.)

What does each code completion model think would be good starters for a future rustacean? Let’s see what they have to say about the following prompt (taking first 6 entries):

## Plan  

Here are some programming tasks I hope to complete with this project:

GitHub Copilot thought I should stick to the basics and make good sure that I know how to use print well. It also seems to be so confident in me that it has marked all the to-dos as done already:

- [x] Create a new project
- [x] Print "Hello, world!"
- [x] Print a number
- [x] Print a string
- [x] Print a boolean
- [x] Print a tuple

Refact.ai wanted me to go further and make something actually useful, though I doubt if anyone would appreciate a calculator asking for internet access:

- [ ] Write a simple calculator
- - [ ] Add basic arithmetic operations
- - [ ] Add basic string operations
- - [ ] Add basic file operations
- - [ ] Add basic network operations
- - [ ] Add basic database operations

Fitten Code had more ambitious plans for me:

- [ ] Implement a basic calculator in Rust  
- [ ] Implement a basic game in Rust
- [ ] Implement a basic HTTP server in Rust
- [ ] Implement a basic database in Rust
- [ ] Implement a basic machine learning algorithm in Rust
- [ ] Implement a basic neural network in Rust

Put side-by-side:

  • GitHub Copilot didn’t even try. The learning plan it built was neither incrementally challenging or jointly comprehensive.
  • Refact.ai offered a plan that is stepping up in difficulty and thoroughness, but the final product is just bizarre. Am I going to end up with a calculator with ray tracing?

Press enter or click to view image in full size

“A screenshot of a calculator app with ray tracing capabilities”. Generated with DALL·E 3 via Microsoft Copilot.
  • If the areas that Refact.ai covered can be said to be as pedagogical as a CS101 textbook in the 2000s, Fitten Code took us to the next decade. I like how it squeezed in trendy topics like ML and neural networks.

I don’t know where I earned this much respect from Fitten Code, but I’m humbled to see such an aspiring roadmap it wrote.

How fast do they generate?

Code completion tools are only as good as they can outrun your brainwork and keystrokes. I recorded the editor window while generating code for the data visualization problem above. All recordings are in their original speed. Here’s what I observed:

GitHub Copilot generated 22 lines over 34 seconds (0.65 lines/sec). A caveat is that, on lines that Copilot wanted leave blank, Copilot will not suggest snippets beyond the empty line (while the other two plugins do). This confused me as I couldn’t tell if it’s just waiting for the server to respond or it just has no more suggestions.

Refact.ai gave me 16 lines over 24 seconds, or 0.67 lines/sec. You can see that I wasted a couple of seconds in the beginning waiting for the autocompletion to kick in. This is the only model where I had to trigger the generation by typing import manually.

Fitten Code yielded 34 lines over 30 seconds, boasting a whooping 1.13 lines/sec. That’s almost double the speed of the other two contesters.

In summary, Fitten Code won over the rest two in terms of speed. I also enjoyed its eagerness in providing suggestions: I didn’t have to type the first few characters (unlike with Refact.ai) or to put in guesswork in whether it’s stuck or just has nothing to say (in contrast to Copilot).

Discussions

Before my employer gave me free access to GitHub Copilot (even for my hobbyist projects), I’ve been a happy user of Refact.ai. However, the new kid on the block, Fitten Code, has stolen my attention. It’s free (at least for now), faster, and more helpful.

Let me reemphasize on why I define helpfulness as a tradeoff between braveness and smartness. When I use a LLM-powered code completion tool, I expect things to be inventive and slightly wrong.

  • I want it to be inventive, because many code patterns would take me a while to recall from memory / look up online.
  • I’m going to take another pass at the code anyway, so I can afford to correct some minor issues.
  • But if there has been too many mistakes such that I effectively have to rewrite from scratch, the inspecting and undoing would simply be a waste of my time.

Looking across the whole landscape of code completion tools, If I cared so much about If I cared about correctness (every suggested snippet must be syntactically correct and semantically helpful), I would have sticked with traditional solutions that are based on abstract syntax trees (ASTs).

With this guiding principle in mind, I conclude this benchmark with a preference for Fitten Code. Of course, my comparisons are biased in terms of the natural of problems, the languages I chose to use, and maybe even my network environment. Though your mileage may vary, if you are causally browsing for a more performant alternative to GitHub Copilot, why not give Fitten Code a try?