Remote collaboration fuses fewer breakthrough ideas? Probably not.

An interesting paper was published in Nature two days ago, titled “Remote collaboration fuses fewer breakthrough ideas”. Here is the abstract:

Theories of innovation emphasize the role of social networks and teams as facilitators of breakthrough discoveries. Around the world, scientists and inventors are more plentiful and interconnected today than ever before. However, although there are more people making discoveries, and more ideas that can be reconfigured in new ways, research suggests that new ideas are getting harder to find — contradicting recombinant growth theory. Here we shed light on this apparent puzzle. Analysing 20 million research articles and 4 million patent applications from across the globe over the past half-century, we begin by documenting the rise of remote collaboration across cities, underlining the growing interconnectedness of scientists and inventors globally. We further show that across all fields, periods and team sizes, researchers in these remote teams are consistently less likely to make breakthrough discoveries relative to their on-site counterparts. Creating a dataset that allows us to explore the division of labour in knowledge production within teams and across space, we find that among distributed team members, collaboration centres on late-stage, technical tasks involving more codified knowledge. Yet they are less likely to join forces in conceptual tasks—such as conceiving new ideas and designing research—when knowledge is tacit. We conclude that despite striking improvements in digital technology in recent years, remote teams are less likely to integrate the knowledge of their members to produce new, disruptive ideas.

This is a classic big-if-true story that Nature loves to publish because it’s catchy, it’s topical, and this is a real subject worth studying that has ramifications for pretty much everyone - scientists and non-scientists alike. To note the importance of this article, this brief commentary was also published alongside the original paper. That title is even more inflammatory than the original paper: “‘Disruptive’ science more likely from teams who work in the same place”. And naturally, various science journalism outlets are reporting on this paper and quoting the authors uncritically.

Given that a lot of the world started working remotely due to the COVID-19 pandemic, and return-to-office policies are coming back in full swing for a variety of reasons, let’s explore this paper a bit and see what it is about.

The disruption score, D

In this article, the authors define a score, \(D\), for a given work, which is defined as follows:

\[D = \frac{n_i - n_j}{n_i + n_j + n_k}\]

where \(n_i\) is the number of derivative works that only cite the work in question, \(n_j\) is the number of derivative works that cite both the given work and its references, and \(n_k\) is the number of derivative works that only cite the given work’s predecessors that it itself cites. Here’s a schematic to explain what this looks like, where arrows between papers correspond to citations.

Schematic of D score calculation -80%

This \(D\) score is the main response variable of this paper, so it deserves some scrutiny.

The distribution of D

When a paper introduces some new metric or statistic, it is worthwhile to think about where this numerical quantity comes from and what its distribution might be like. Clearly, by definition \(D \in [-1, 1]\). Similarly, \(D\) will be 1 if and only if \(n_j = 0 = n_k\) (i.e. all derivative works cite only this work and none of its predecessors) and \(D\) will be -1 if and only if \(n_i = 0 = n_k\) (i.e. whenever this work is cited, at least one of its predecessors is always cited, and vice versa).

What if a paper that is not cited at all? By definition, \(D\) is undefined, and that is a problem, as we’ll see below.

Similarly, a paper with 1 citation can have \(D\) = -1, 0, or 1, depending on the type of citation. That, too, is a problem, as we’ll see below.

Already, we can begin to see that there is a breakdown in the intuition of what a “disruptive” paper is and the numerical value that \(D\) has. The intuition behind defining \(D\) this way is useful, but one must always be careful not to confuse mathematical definitions with objective facts. This is important for interpreting the results of the paper.

D is a heuristic, not an intrinsic property

What are some extreme examples of works included in their dataset that can help us understand what the value of \(D\) might mean? The authors use Watson and Crick’s 1953 paper on the double-helix structure of DNA as a positive control. \(D = 0.96\) in this case, which seems apt.

But what about something like a textbook? Textbooks aggregate lots of information together to provide readers with understandable and comprehensive information to refer back to. Good textbooks are often cited in introductions or when describing important topics to an audience unfamiliar with the field, which is useful for journals with big general audiences, like Nature. For a paper that cites the textbook, unless it specifically focuses on a method or result that the textbook covers, the textbook’s predecessors are unlikely to be cited, too. I would suspect that textbooks would have high \(D\) scores, but I doubt anyone would define textbooks as “disruptive”.

Similarly, what about retracted studies? Retracted papers like Wakefield’s widely-debunked paper attempting to link childhood vaccination with autism would attract so much attention from rebuttals and case studies on misinformation. Hardly anyone would cite any of its predecessors. This could lead to a very high \(D\) score, but I do not think that retracted and blatantly false papers should be thought of as “disruptive” - at least in the context this paper is concerned about.

What would a “developing” work look like? As a negative control, the authors cite the initial draft of the sequenced human genome. Now, we can debate how important different papers are, but this paper is a seminal work that helped birth or transform entire fields of study. I, personally, don’t think that’s a great example of a paper that isn’t “disruptive”. Notably, this paper cites Gregor Mendel’s work on heredity, one of the all-time greats of biological research. People who know nothing about genetics probably know about this work, leading to citations outside biology. But does that mean that sequencing the human genome is less “disruptive” because it merely cites more influential work?

Similarly, does \(D\) change in time? As time passes and papers garner more citations, \(D\) will, by definition, change value. As we’ll see below, this relationship to time ends up being crucial.

The lesson here is that we can come up with examples where the value of \(D\) doesn’t match with the interpretation we have of “disruption”, “development”, “innovation”, or any other topic the paper discusses. \(D\) is a heuristic for disruption, not an intrinsic property of the works. The authors could give examples of where and how this interpretation fails, but they do not. That is a red flag about the veracity of this paper, in my opinion.

But there are more concerns about this paper that are also worth discussing.

Data concerns

Lesson Number 1 I learned from my supervisor in grad school is “look at your data”. Unfortunately, the paper and its Extended Data Figures don’t actually show the original data. It only shows averages and aggregations of the data. So what does the original data in this study look like?

Undefined scores in the data

The data for this paper is published with the manuscript, which is excellent, as all papers without privacy concerns should. So we can inspect what data the authors are working with. Let’s return to the issue of a paper having no citations.

> citations[is.nan(d_score), .N]
[1] 5649477
> citations[is.na(d_score), .N] 
[1] 5649477

Of the original 20 273 444 papers, 5 649 477 (28%) have an undefined \(D\) score. This likely comes from dividing by 0, which can happen for non-cited papers, as well as papers that have incomplete or missing citation information. In the paper, there is one mention of the word “missing” and it refers to building “machine learning models that effectively infer team roles for papers with implicit author contributions”, and not the \(D\) scores themselves. There are no mentions of incomplete data or division errors, either.

Almost every paper has a near-zero score

How about the observed distribution itself?

The empirical probability density function of all papers used in this study has a massive spike around D = 0, and is essentially non-existent everywhere else. -80%

> quantile(
    abs(citations$d_score)
    , c(0.01, 0.05, 0.1, 0.25, 0.5, 0.75, 0.9, 0.95, 0.99)
    , na.rm = TRUE
)

          1%           5%          10%          25%          50%          75% 
0.0000000000 0.0000000000 0.0000517633 0.0003255208 0.0014695077 0.0056657224       
         90%          95%          99%
0.0185089974 0.0393398519 0.2000000000

Almost every single paper used as a data point has \(D \approx 0\), which the authors describe as having a balanced “disruptive and developmental character”. Said another way, 14 333 693 of all papers in this study have \(\left| D \right| < 0.2\). This leaves 290 274 papers (< 1%) with \(D \in \left[-1, -0.2 \right)\) or \(D \in \left(0.2, -1 \right]\).

Remember how Watson and Crick’s paper was described as within the top 1%? Any paper with \(D \ge 0.2\) is in the top 1%, not just Watson and Crick. A paper with 5 total citations where 3 derivative works only cite the paper in question and 2 others cite both the paper and one of its predecessors has \(D = 0.2\). While “disruption” is subjective and context-dependent, I think there are many reasons to doubt that a paper with 5 citations in total should ever be considered within the top 1% “most disruptive”.

Now, ~ 290 000 papers is not nothing. There still could be important insights gleaned from this many papers. But this is an extremely dangerous statistical scenario to find yourself in:

you have lots of missing data that isn’t explained,
most of the variance in your data is driven by < 1% of your observations,
the sample size is really large, so nearly any null hypothesis statistical test you run on any stratification of this data is going to produce “statistically significant” \(p\)-values, even if the effect size is negligible

It is difficult to make sense of anything in scenarios like this unless you really understand the distribution of the response variable and the data generating process that produced it. But the authors do not mention any exploration of this data in the paper and there is no supplementary note nor expanded methods section that discusses this, either.

Methodological issues

The authors do the right thing by examining alternative explanations for the “reduced disruption of remote teams”. However, there are a number of clear mistakes in methodology here.

Using proxies without further analysis

Firstly, they use proxies everywhere and don’t explain how good these proxies are or how they may relate to one another.

We calculate the geographic distance between co-authors using the geographic coordinates of cities instead of institutions, so that we identify team members from the same city as on-site, regardless of city size and the distance between institutions within the city.

There are real and important differences between “on-site” and “in the same city” that are glossed over when conflating these two ideas. Some remote collaborators are extremely easy to get a hold of. Some collaborators working in the same institution are extremely hard to track down. Simply co-locating individuals based on the cities in which their primary affiliations are located is a broad brush that does not allow for any nuance that actually matters when discussing team dynamics and working relationships.

For each team, we calculate the average time-zone difference between all pairs of team members as a proxy for the underlying temporal separation.

It is true that time-zone separation is a communication hurdle that remote teams must navigate. But time zones are extremely arbitrary and run from the North Pole to the South Pole. Toronto, Montreal, and Bogotá all share a time zone. Certainly, research that bridges any two of those cities will have additional, and more influential factors that affect how research progresses, no? Things like linguistic differences, funding sources, lab infrastructures - how do you capture all of those important factors in a single “temporal separation” quantity?

We calculate the interdisciplinarity of team members and use it as a proxy for the diversity of knowledge to which the team has access.

Research today tends to be much more interdisciplinary than research of the past. Remote work is much more common now than in the past. This means that remote work and interdisciplinarity are confounded by time, which also confounds citation counts, and thus the relationship between remote work and \(D\).

Misinterpreting non-significant effects with insufficient explanation

The authors interpret “non-significant p-values” or “no sign change” as insufficient evidence to explain away the effect of remote work. The Extended Figures and Data Tables do not contain this information. The authors don’t evaluate any statistical model coefficients, they simply note the negative slope in their figures, and they do it multiple times.

We then include the constructed variable [team member interdisciplinarity] in our regression models and find that the negative impact of remote teams on disruption remains intact (Extended Data Tables 1 and 2). We conclude that differences in team heterogeneity are unlikely to explain the observed difference between on-site and remote teams.

However, when we include career age in our regression analysis, the negative impact of remote teams on disruption remains unchanged. We conclude that the age structure of remote and on-site teams cannot account for our key findings.

Second, to further mitigate concerns over selection bias, we run author fixed effects regressions and confirm that the negative impact of remote teams is still statistically significant (Extended Data Tables 1 and 2), although the magnitude of the coefficient is reduced, possibly because less disruptive scholars end up at more marginal universities, where they benefit more from the opportunities for remote collaboration. We conclude that selection or individual differences cannot fully explain the observed difference between on-site and remote teams.

We then include a binary variable of weak-tie collaboration in our regression model and confirm that although weak ties are associated with more disruptive discoveries, the negative relationship between distance and disruption remains intact (Extended Data Tables 1 and 2). Hence, even although remote teams have access to more diverse knowledge through weak ties, they fail to exchange, fuse and integrate that knowledge to generate disruptive ideas.

This last example suggests weak-ties might produce more disruptive work, which conflicts with the main hypothesis of the paper, since remote work can produce more weak-ties. The causal relationships between these variables are not mentioned in this paper, and no further attempt is made to better understand the seemingly contradictory nature of this result.

Time between publication and citation

The authors list one of their two major data sources as follows:

The first dataset includes data for scientific research teams responsible for 20,134,803 papers published by 22,566,650 scientists across 3,562 cities between 1960 and 2020.

There are a number of issues that arise when using citations as a proxy for scientific impact. For example, papers published in 1960 have had 60 years to acquire citations by the end of 2020, whereas papers published in 2020 would have had at most 12 months. How does this difference in time-to-acquire citations relate to \(D\)? The authors do not discuss this.

Similarly, the entire world was thrown into disarray in 2020 because of the onset of the COVID-19 pandemic. Papers that theoretically would have been published in 2020, but weren’t, aren’t able to cite papers from recent years, likely lessening their citation count. How does this disruption affect these trends over time?

More importantly, people started working from home more than ever before because of the pandemic. That combination of fewer publications + more remote work happened both because of the COVID-19 pandemic. This is a clear case where remote work did not cause decreased apparent disruption - the pandemic did. The authors mention the pandemic only in the Discussion and the impact of this work, but not at all when explaining the data or methods.

Insufficient description and evaluation of their machine learning models

The authors use “machine learning techniques that infer team roles for papers for which author contributions are not explicit, thereby increasing our sample size”. The authors make serious omissions in their Methods section:

no description of the type of neural network they use,
no mention of the layer structure of the network,
no describing the separation of training and test data,
no evaluation of the model’s ability to infer team roles aside from “a precision of 0.79 and a recall of 0.793 in predicting author roles”,
no evaluation of the “ground truth” team roles provided by the original dataset

These are all necessary components in evaluating whether a machine learning based model is accurate or useful. In particular, does the precision and recall of this model vary across different ranges of \(D\)? Since 1% of papers have \(\left| D \right| \ge 0.2\), it is possible that all the highly disruptive papers have misclassified author contributions. Initiatives like the Contributor Roles Taxonomy aim to clarify the roles of authors in published research, but these are recent efforts. The ability to correctly identify author contributions is likely confounded by time, too, since research groups tended to be smaller in the past and the precise individual contributions were less emphasized.

Using ordinary least squares on non-normal data, improper samples

In the Jupyter notebook published alongside the article, the authors have this snippet of Python code:

### regression between citation impact difference and probability of co-conceiving
import statsmodels.api as sm
df = pd.DataFrame()
y_1_all = []
x_1_all = []
for i in sorted(x1):
    if i <=4:
        x_1_all += [i]*len(x1[i])
        y_1_all += x1[i]
    
# ... omitted for brevity
    
df['y'] = y_1_all
df['x'] = x_1_all
y = df['y'] 
x = df['x']
x = sm.add_constant(x)

#fit linear regression model
model = sm.OLS(y, x).fit()

#view model summary
print(model.summary())

Ordinary least squares regression assumes the response data comes from a distribution with normally distributed errors around the linear predictor ¹. The code above is used in Fig 4b, which compares whether authors co-conceived ideas for a paper against the binned absolute difference between cumulative citation counts for each pair of authors on the paper. Is co-conception normally distributed around the binned citation difference? No, the response variable for this regression (whether the authors co-conceived the paper or not) is a binary outcome, so it cannot have normal residuals.

In this scenario, something like a logistic regression is often used to infer log-odds as a linear combination of the explanatory variables. We can actually do this with the processed data the authors provide. The code looks like this:

# fit a log-odds model
>>> logit_mod = sm.Logit(y, x)
>>> logit_res = logit_mod.fit()
>>> print(logit_res.summary())

If you run this, the results from the two cases the authors analyze in Fig 4b change a bit, but the interpretation stays aligned with the point the authors are trying to make.

However, spending too much time on the details of what test to use glosses over the bigger error for this section. The authors aren’t interested in how citation differences impact co-conception. They care about how on-site versus remote collaboration will impact the next generation of researchers. The citation difference is meant to be a proxy for early-/late-career researchers. From the article:

Building on the result that on-site teams involve more talent in conceiving research, we turn to explore how this affects the next generation of researchers, distinguishing between team members by their citation impacts.

The issue is whether working remotely versus on-site will impact early career researchers’ abilities to conceive ideas and take leading roles in paper production. The authors care about a given paper’s author contributions, so the papers are the unit of replication. When you look more closely at the data the authors use for the statistical test, here is what it looks like.

>>> df
        y    x
0       1  0.0
1       0  0.0
2       0  0.0
3       1  0.0
4       1  0.0
...    ..  ...
296856  0  4.0
296857  0  4.0
296858  0  4.0
296859  0  4.0
296860  0  4.0

There are 89 575 papers that the authors analyze for Figure 4, coming from 21 373 scientists who worked in both on-site and remote teams. But the data they use for statistical testing are all pairs of authors on each paper (155 842 pairs for on-site teams and 296 861 pairs for remote teams), and they treat each pair as an independent sample, even for multiple author pairs from the same paper. A single paper with 5 authors produces as many author pairs as 10 papers with 2 authors (10 pairs in total). A single paper with 20 authors produces as many author pairs as 19 papers with 5 authors (190 pairs in total). This is clearly an incorrect way to model author contributions and does not correspond to what the authors are suggesting in their paper.

Potential explanation of the observations

I believe the mostly likely source of the observed negative relationship between “disruption” and “remote work” comes from Berkson’s paradox. This is when two causally independent variables have an observed negative correlation due to unaccounted-for relationships.

Consider a private school that selects students based on their academic and athletic prowess. Academically gifted students can be found in the school who have no athletic ability. Similarly, athletically talented students can be found in the school who perform poorly in academic studies. There will be some number of students who are both academically gifted and athletically talented. There will not be any student that is both academically poor and athletically poor.

If you consider all students in the school and correlate athletic ability and academic excellence, you’ll find a negative correlation between them. But this negative correlation isn’t due to athletic ability and academic excellence existing in opposition to one another. This negative correlation exists because of the school’s selection criteria, regardless of whether academic and athletics are causally related at all.

Here is an example with some meaningless, simulated data of 1000 students.

An example of Berkson's Paradox, visualized with different selections of the same data

Filtering out low-achievers induces a negative correlation between the two variables (middle panel), even though there is a near-zero correlation between them in the total population (left panel). Interestingly, the same effect can be seen when if you filter out students who are high achievers in both athletics and academics (right panel). A negative correlation is induced in the observed population by the selection criteria and the \(p\)-value is small due to the large sample size.

In this paper, the authors find a negative correlation between \(D\) and co-author distance. It is important to consider a few points:

Most scientific research is done locally
Only within the last few years have remote work become feasible for many scientists
Most scientific research is not disruptive
Recent publications have not had as much time as older papers to become recognized and cited

If the distance of research groups is independent of the disruptive impact of the work, it’s probable that few “disruptive” papers from remote teams exist simply because not enough time has passed to allow them to be published or recognized in sufficient numbers. This hypothesis contradicts the explanations given by the authors, yet is still consistent with the results within the paper. Here is a causal graph showing some of the relationships that connect disruption, \(D\), and other variables the authors mention (or should have mentioned) in the paper.

A causal graph showing some back door relationships that connect disruption, citations, and D -80%

The relationship the authors comment on is “Co-Author Distance” \(\rightarrow\) “Disruption”. But “Disruption” is a nebulous concept and unobservable, so the authors use the proxy relationship “Co-Author Distance” \(\rightarrow D\). As you can see from the graph, there are back door paths connecting “Co-Author Distance” and \(D\) that do not contain colliders and do not pass through “Disruption”. For example: \(D \leftarrow n_i \leftarrow\) Publication Year \(\rightarrow\) Ability to Work Remotely \(\rightarrow\) Co-Author Distance.

The relationship between “Co-author Distance” and “Disruption” is confounded by many factors that are not accounted for.
The relationship between “Disruption” and \(D\) is confounded by factors that are not accounted for.
The relationship between “Co-Author Distance” and \(D\) that is independent of “Disruption” is confounded by the ability to work remotely.

These confounding relationships are not discussed, but are crucial to correctly interpreting their results. To perform a non-confounded analysis of the data they’ve gathered, the authors would have to adjust for most of the variables in this causal graph, which they do not do, or they only do one at a time. If I was a reviewer for this paper, I would have rejected it on this fact alone. It almost doesn’t even matter what the rest of the paper says, because not considering these relationships prevents the authors from being able to uncover anything of value in the first place. This is true regardless of whether their hypothesis is correct or not.

Conclusions

I can summarize the problems with this paper as follows:

Confounded relationships between the explanatory and response variables that are not properly adjusted for,
Ill-defined response variable that is insufficiently explored and whose quantitative value does not match the interpretation the authors suggest,
Large amounts of data with NaNs that are not discussed,
A large dataset with most of its variance contained in < 1% of the overall data,
Insufficient analysis and discussion of raw data,
Insufficient description of methods,
Inappropriate statistical methods and interpretations of results,
Results shown do not refute a hypothesis that contradicts the authors’ original hypothesis, and
Overly strong conclusions that are not supported by the evidence provided.

For these reasons, I believe this paper should be retracted.

Impact of this paper

This paper, as written, has major potential implications for all workers. Executives, consultants, and managers can read the paper or news summaries and be grossly misinformed of what this paper is and what it shows. This paper can be easily cited as a reason to force remote employees back into the office, simply from the title and reputation of Nature alone.

Strong claims, like “remote work hinders disruption” or “we can’t have remote workers if we want to be disruptive” require strong evidence. However, this paper does not show that remote work is less disruptive. This paper doesn’t truly show anything because of its methodological flaws and mistakes in interpretation. Given the article is already published, it will likely take years to claw back the misinformation and misunderstanding this paper will generate, even if retracted immediately.

What this paper is, is a reasonable first attempt at understanding the relationships between remote work and innovation. I applaud the authors’ attempts to understand truly important and topical questions. I appreciate the data and code that is immediately available with the paper. I empathize with the authors’ difficulties working with big datasets and statistical models. I also understand the excitement that comes from having a potentially huge result and wanting to share it with the world.

But this paper is full of traps and mistakes that are present in all academic disciplines. Experienced academics should know better than to let those mistakes go unquestioned and should guide trainees to avoid these traps. Reviewers should be able to spot these issues immediately and prevent work like this from ever being published.

It is worth questioning what happened in the review process that allowed this paper to slip through. It is also worth reiterating that science journalists need to be better at questioning published papers, instead of just parroting them. If not, science journalists become misinformation superspreaders, themselves.

As someone who has done many terrible first pass analyses on big datasets, I understand exactly how and why this paper happened. I feel bad for the first author, who appears to be a PhD candidate. At such a delicate stage of your training as a scientist, every criticism of your work can feel like a direct criticism of you as a person, even when they are not. I do not believe you to be a bad scientist - I believe in your ability to learn from these mistakes and do better science in the future. But the severity and number of mistakes in this work are not acceptable as a peer-reviewed, published paper. And it is certainly not acceptable in a journal whose results will be accepted by many without criticism.

Mistakes like these are difficult, but necessary, learning opportunities for everyone involved. Science as an industry will not get better unless we learn from situations like these.

Additional notes

Code and figures used in this post can be found here in this git repository.

Updates

2024-01-23: Made language more precise about OLS, expanded on analysis used for Fig 4b.
2024-01-27: Minor update to the causal graph that removes the “total citations” variable, since it’s not relevant to the points discussed in this paper. Adding link to git repository.
2024-01-28: Contacted the corresponding authors via email and posted this summary of concerns on PubPeer.
2024-02-11: Sent another email to all 3 authors, after not receiving any response from the email 2 weeks ago.

Comments on Mastodon.