AI Search Engines for Science: the Good, the Bad, and the Ugly

One of the promises, arguably the main promise, of Google in the early years was that now all the world's information was at your fingertips. It was a huge step change at the time. And though the search engine has continued to change (in recent years for the worse), another product they created way back when has changed a lot less: Google Scholar.

Google Scholar works well for what it is: I can throw in a few keywords, maybe filter by year, and find a bunch of articles that are in the general vicinity of what I'm looking for. But we have come so far since the days of PageRank and keyword search plus a bit of semantic dusting on top. There's a whole new crop of LLM and AI-infused search engines for academic articles. What's out there? And does it tell us anything about where knowledge retrieval is heading?

Surveying the landscape

I reached out to various academic communities I'm part of and did some searches, ending up with a list of a few dozen search engines–some I'd tried, some I hadn't. Obviously, for this article I'm not interested in variations on Google Scholar or indexes: so while of course RefSeek, Science.gov, and BASE have their uses, they don't appear to be leveraging modern language models. There are also a few domain-specific engines I encountered like Luva and Emergent Mind, but I want all of science, or as close as is possible given the current realities of licensing journal content. Then I eliminated projects which mostly ignore the paper itself for the metadata (Research Rabbit, Litmaps, and Connected Papers), and neat tech demos (HasAnyone), and I also decided to eliminate a few platforms that seemed to just be variations of article retrieval and RAG but worse. It's never a good sign when I try out your platform and my first search question yields only irrelevant results, and then an LLM is forced to write a paragraph regretfully explaining how each result is irrelevant to my query. It feels like a waste of time for me and, frankly, the model.

Eventually I whittled the list down to this:

- Exa: an SF-based VC-funded startup that looks like it wants to build a better version of Google

- Elicit: another new VC-funded Bay Area startup that is focused on academic search, as well as consulting services

- Consensus: a Boston-based startup founded by people from the sports world

- Scite: a Brooklyn startup acquired by Research Solutions last year as part of that company's pivot into AI

- Semantic Scholar: a tool created by the Allen Institute for Artificial Intelligence, a non-profit

And as my baseline comparisons:

- Google Scholar

- Claude and/or Perplexity

I unfortunately wasn't able to have one of my baselines be "go to an academic library and ask a librarian" but that's definitely a viable research pathway for academics and something these startups know they're competing with.

I tried a range of questions on them, from "What's the ellipticity of Earth?" to "What are the measurable cognitive effects of caffeine?" to fishing for my own published papers from back in the day with searches like "What are the effects of embargoes in simulated agent-based markets?" Eventually three major questions emerged:

1. How much generated text is the right amount for a research assistant?

2. Are these searches significantly more precise (than Google Scholar, et al)

3. Does this clearly beat Google Scholar + chatting with an LLM as a product that seems to have legs?

A few facepalm moments to eliminate a few more

Consensus gives me consistently bad search results, even with their paid Pro plan. I honestly couldn't run a test on this platform without being frustrated about something: either bafflingly irrelevant search results or an overconfident generated text undermined at least half of my tests. I just don't understand how they can be this bad: how does a search for papers on agent-based models lead to a nursing school training research? Or a query about the environmental impacts of LLMs bring up papers about lean manufacturing? These are the first papers in the list of search results.

I also wasn't a fan of Epsilon AI. It didn't seem to be any more than a RAG of all of science, and there are better alternatives of that. And it is slow, taking tens of seconds to show me retrieved papers. I get that generated text is expensive, but making me wait over a minute to see the generated text that is the entire point of your product kills it for me.

How much generated text is ideal, if any?

Sometimes a language model will give you exactly the answer that you need. Claude is better than any of the services I tried at answering questions like, "what's the ellipticity of Earth?" or "Who in Seattle built a video game where you make robots out of DNA?" (The answer to the first is 0.0033, and the answer to the second is my team at a University of Washington lab in 2013.) But language models hallucinate, so I can never fully trust generated text. So one model of a research platform is to just give language models papers to reference while they're writing a short essay answering my question. The platforms that take this tack are Elicit, Epsilon, and Consensus.

But another model, and the one I ended up preferring, is to use generated text as either an on-ramp into literature or a way to interrogate results after you'd already found them. These search engines prioritize getting the most relevant passages of human-written text in front of you fast, possibly with some intermediary generated summaries along the way. The platforms I tried in this category were Scite, Exa, and Semantic Scholar.

I have two main problems with the services that heavily emphasize chatting with an LLM: 1) I just can't fully trust what it's telling me, and 2) an essay is usually not my preferred user interface. The more the service emphasizes being in dialogue with the chatbot, the more it feels like homework to actually check its sources, but we just don't have the language models today where you can take everything they summarize from the retrieved papers as gospel. The essays are sometimes nice when I don't know much about the domain I'm poking around in, but they're just less information-dense than the actual paper results so I usually want to see those pretty quickly.

Exa's product decision here is simple: put the chatbot behind a button push, after the user gets their results. You can throw a few papers into a single chat, and it makes sense because that lets you choose the papers that are actually relevant, and opt-in to the slower results of generated text.

Scite also doesn't return any generated text by default, instead giving you a list of passages from the papers that should be most relevant to your query, along with some context and metadata about those passages' origins. They have a chatbot that relies on the "smart citation" results set, but puzzlingly, you have to perform the search all over again if you want to switch from search mode to assistant mode.

That said, there are definitely times when a generated essay is better than search results, like if your question itself is wrong. Searching for "what's the eccentricity of the Earth?" it actually took Claude to mention that it was giving me the numerical value for the eccentricity of Earth's orbit–the term for the deformation of a sphere is ellipticity.

Are these platforms significantly more precise?

It depends on what kind of search you're doing, but I'd say overall the searches seem better, but not miles ahead. Sometimes Google Scholar would actually be better than Elicit or scite or Semantic Scholar. Sometimes even Consensus would return decent results. This probably also depends on the subset of scientific papers that each service has–I know Consensus, Elicit, and Epsilon all pull from Semantic Scholar, but I couldn't find substantive information on where Exa or Scite get their papers. Sometimes Semantic Scholar's results are great, and sometimes it just returns zero results with queries other platforms handle just fine.

Overall, from the week of testing I gave all the platforms, they're definitely better than Google Scholar et al–just not miles ahead.

Do any clearly beat Google Scholar?

Going through this whole process has kind of reinforced to me how great Google Scholar is. There are some compelling features in these new search engines–scite's "smart citations" and Elicit's tabulated generation are nice–but Google Scholar is free and, possibly more importantly, has been stable for 20 years. If I was using these platforms every day I'm sure I'd want a platform that gave me lots more power user features, but yeah I'm just gonna say it, Google Scholar is still holding its own for now.

If scite's search was as good as Exa's (or even Google Scholar's, depending on your query), I would probably use it frequently, maybe even daily. But the fact that each search result takes up more space means that each missed result's effect is magnified, and I've had enough missed results that I can't enthusiastically recommend the service in its current form. It's just not (yet) the killer feature that it could be.

Parting thoughts

I'm not an academic, I'm just a humble hacker who has spent years as a research engineer and at startups spinning research out into products. So my interest in these tools skews heavily towards the power user side of products, but I also can't speak much on the tools that someone writing papers as part of their job is going to need.

There's also a huge caveat to all these platforms, especially those that are young startups. Change is very likely--not only might we see more step changes in model capabilities, but we'll almost certainly see business models change as startups try to pivot into profitability. Plenty of them will be acquired by larger companies, which might mean an end to the product as it exists--and some of the search engines feel less like viable products and more like advertisements for the team.

Maybe one of these platforms will figure out how to make the generated essay format work well, but embeddings-based search is just better right now as the main search modality. Keep the chatbot off to the side so the user can opt-in when they want it. As for what this all says about the future of knowledge retrieval? All the pieces of a great pro search experience are there. Someone just needs to put them together and make it a sustainable business.

‍

Table of Contents

This is some text inside of a div block.

AI Search Engines for Science: the Good, the Bad, and the Ugly

Surveying the landscape

A few facepalm moments to eliminate a few more

How much generated text is ideal, if any?

Are these platforms significantly more precise?

Do any clearly beat Google Scholar?

Parting thoughts

Related Stories

Unlocking the Magic of Agentic Frameworks: Building Collaborative AI Teams with LangGraph

Embracing AI in Higher Education

Write Smarter, Not Harder: AI-Powered Prompts for Every Product Manager

10 ways to succeed at ML according to the data superstars

Challenges in Traditional Nutrition Tracking and How AI Overcomes Them

7 Proven Ways AI and Customer Support Transform into a Cost-Efficient Powerhouse

Top 10 Common Challenges in Developing AI Solutions (and How to Overcome Them)

AI in Private Equity: A Guide to Smarter Investing

Leveraging Data Science – From Fintech to TradFi with Christine Hurtubise

Get started with Tribe

Find the right AI experts for you

Join the top AI talent network