Biomedical Data and Artificial Intelligence
Image
By @WarrenWhitlock: “I asked #midjourney what a 2020 library might look like?”
Originally posted here
https://marinatalamanou.substack.com/p/biomedical-data-and-artificial-intelligence
From the Peer Review Process and the Reproducibility Crisis 🤥 to AI solutions for Data Mining ⛏️
The peer review process ✍🏻
The academic peer review took its first steps in 1665 in order to ensure that “the honour of X author’s invention will be inviolably preserved 🔏 to all posterity”. For that reason it was determined that “the Y article in the Society’s Science Transactions should be first reviewed by some of the members of the same (reviewers)”.
This system at the heart of all science has remained essentially unchanged since 1665 and nowadays it is the method by which:
papers are published 🪶 for dissemination of scientific knowledge,
grants 💱 are allocated,
academics are promoted 🎓👩🎓👨🎓 and
Nobel prizes 🏆 won.
But this process is now very stressed.
In 2005, the legendary Greek-American Stanford epidemiologist John Ioannidis wrote a paper —which has become the most widely cited paper ever published in the journal PLoS Medicine —and examined how issues currently ingrained in the scientific publishing process indicate that at present 📢: “Most published findings are likely to be incorrect”.
In 2006, Richard Smith—a British medical doctor, editor, businessman and chief executive of the BMJ Publishing Group for 13 years among other things—said 📢: “Peer review it is hard to define. It has until recently been unstudied. And its defects are easier to identify than its attributes. Yet it shows no sign of going away. Famously, it is compared with democracy: a system full of problems but the least worst we have.”.
After Ioannidis, a decade later, the UK–based medical writer Richard Horton and editor-in-chief of The Lancet put it only slightly more mildly 📢: “Much of the scientific literature, perhaps half, may simply be untrue”.
Moreover, according to Richard Smith and Christopher Tancock (Editor-in-Chief of Elsevier) the peer review process is:
slow 🐌 and expensive 🫰,
inconsistent 🤷♀️,
with reviewers sometimes turned out to be fake 🤥, overworked, under-prepared, not consistent and rarely paid for,
with agencies that “handle 🦮 the peer review process” for authors,
with journal shopping 🛒, a process where scientists submit first to the most prestigious journals in their field and then working down the hierarchy of impact factors,
with citation manipulations 🧞,
with ghostwriters 👻,
with flagrant conflicts of interest and power bias 🐊,
with fashionable trends of dubious 🧐 importance,
with publication bias: a process where negative results go unpublished 🧹🗑️, together with small sample sizes, with tiny effects and invalid exploratory analyses, and
with an obsession for pursuing fashionable trends of dubious importance 🦖, that has allowed science to take a turn towards darkness 🌑.
As a result of all above, the replication or reproducibility crisis in the scientific publishing industry has emerged.
Replication or reproducibility crisis
In 2011, a group of researchers at Bayer decided to look at 67 recent early-stage drug discovery projects and they found that in more than 75% of cases the published data did not match up ⛔ with their in-house attempts to replicate. Keep in mind that these were blockbuster research studies featured in Science, Nature, Cell and the like (Source: "House of Cards: Is something wrong with the state of science?" by Harvard Edu).
In a paper published in 2012 ("Raise standards for preclinical cancer research") one of the authors Glenn Begley—a biotech consultant working at Amgen—said that during his decade of cancer research he tried to reproduce the results of 53 so called landmark cancer studies, namely highly influential papers that have substantially changed the practice of medicine.
But, after his team wasn’t able to replicate 47 out of these 53 studies—even after repeating 50 times the experiments in each study— he realised 🤔🤫 that in the originals studies the authors had repeated the experiments only 6 times, finding positive results only once and publishing only this positive result!!!
Surprisingly under our current scientific publishing system, most of the information about the boring negative results, is just brushed 🧹 under the carpet and this has huge ramifications for the replicating research, since researchers build theories on the back of landmark cancer studies (that consider them valid) and investigate the same idea using other methods.
Moreover, when they are led down the wrong “research path” then huge amounts of research money and effort are being wasted 🚮 and the discovery of new medical treatments is being seriously retarded. Consider also that every month there is some kind of news about replication issues in the scientific publishing industry (Source: "1,500 scientists lift the lid on reproducibility"), and you can get an idea of how serious is this problem.
And it only gets worse.
A huge amount of early-stage research gets presented only at conferences 🎤🗣️(abstracts, posters and presentations)—and it is estimated that only half of it appears in the academic literature—since these studies presented only at conferences are almost impossible to find or cite since very little information is available online.
When a systemic review done in 2010 investigated at what happens to all conference material by looking at 30 separate studies, it came out that in the vast majority the unflattering negative results (presented only on conference presentations) are more likely to disappear 🪄 🫠 🫥 (Source) before the study becomes fully-fledged academic paper.
Furthermore, specific academic literature can be ghost managed, behind the scenes, to an undeclared agenda. In reality, some academic articles are often written by a commercial writer (ghostwriter) employed by the pharma, with an academic’s name placed at the top to give imprimatur of independence and scientific rigour 🦚🦃. Often these academics have had little or no involvement in collecting the data or drafting the paper.
And here is where the problem only gets bigger.
Developing a new prescription medicine that gains marketing approval is estimated to cost drug makers $2.6 billion, with overall success rates 5.1% for cancer drugs and 11,9% for all other drugs (from phase 1 to FDA approval 🆗). Furthermore, the entire process of drug development takes 10 to 15 years, and for each $1 billion spent on R&D the number of new medicines approved has halved roughly every nine years since 1950 (Source).
Accordingly, if early-stage research is where novel hypothesis for future drug-biomarker candidates are being formulated —but once early-stage research goes through the “bottleneck of the peer review process” comes out as a replication crisis —where do researchers are supposedly going to get new lead generators for further drug-biomarker development⁉️⁉️
And because visuals just work 👉
The decline ⏬ in pharmaceutical R&D efficiency eating the data replication crisis: https://twitter.com/KevinUncensored/status/1305280821918539776
AI/ML Tools for BioMedical Data: What’s New? 🆕
We already have several AI/ML tools for the peer review process that can assist us in areas such as plagiarism prevention, requirements compliance checks and reviewer manuscript matching. For example,
Artificial Intelligence Review Assistant (AIRA): is a platform to support editors, reviewers and authors ✍️ to evaluate the quality of manuscripts and to help meet global demand for high-quality, objective peer-review in publishing.
UNSILO: uses a corpus-based concept extraction tool to identify hundreds of concepts (key phrases that distinguish each article from all the others in the corpus) from a submission and ranks them in order of relevance to that paper 📜. Then matches the resulting cluster of concepts with 29 million articles and abstracts in the PubMed corpus.
Statcheck: is a statistical programming language designed to detect statistical 📊 errors in peer-reviewed psychology articles.
Penelope.ai: is an online tool that automatically checks 🕵️ whether scientific manuscripts meet journal requirements (such as references and the structure of a manuscript).
StatReviewer: is an automated reviewer of statistical errors 📊 and reports integrity for scientific manuscripts.
Peer Premier: is a private company founded in 2021, that intends to effectively separate the peer review process from journals and their publishers. Peer Premier provides an independent, double-blind review using a standardised and comprehensive rubric for assessing manuscripts, making the review more quantifiable and accessible than current practices. Reviewers are selected through AI 📇 and the algorithm serves as a scholarly matchmaker, picking qualified reviewers for a manuscript regardless of their institution or background.
However, the opposite is also true since some AI/ML tools can make researchers’ life even harder during the already very stressed peer review process. For example, the AI models like DALL-E, Stable Diffusion and Midjourney that already produce realistic pictures of human faces, objects and scenes, it's a matter of time before they start creating convincing scientific images too, as pointed out in the article “Thanks to generative AI, catching fraud science is going to be this much harder” by The Register. In particular, the author Katyanna Quach highlighted how some image analysts while scrutinising data in scientific papers came across a strange set of images, that appeared in 17 biochemistry-related studies, and had the same background. These analysts concluded at the end, that the suspicious-looking images of western blots under in investigation 🧐, were most likely computer-generated images and produced as part of a paper mill operation: “an effort to mass produce bio-chemical papers using faked data, and get them peer reviewed and published”.
Off course data mining and knowledge extraction of biomedical data is not only about Peer-Reviewed Literature and relevant omics and imaging data, but is also about Surveys, Medical Records (CAT and MRI scans, signals from EEG, laboratory data from blood, specimen analysis and clinical data from patients), Claims Data, Vital Records, Surveillance, Quiz and Unpublished Proprietary Lab Data (observations and lab notes) making biomedical data armageddon 🌪️ a real possibility.
Let’s see now the AI companies dealing with this armageddon.
📉 In 2021, LatchBio emerged from stealth mode providing almost code-free biocomputing solutions on the cloud ☁️ that can be accessed from anywhere via a browser to simplify biological data analysis. Using their platform, researchers can upload files and access dozens of bioinformatics pipelines and data visualisation tools from analysing RNA sequencing data to designing CRISPR edits and even running the AlphaFold software just from their laptop. LatchBio has raised a total of $33.2M in funding over 4 rounds.
📉 Euretos uses natural language processing to interpret research papers (2- 2.5 million new scientific papers are published each year in about 28,100 active scholarly peer-reviewed journals), but this is secondary to the 200-plus biomedical-data repositories it integrates. In particular, they provide biological knowledge graphs 🕸️ that semantically harmonise public and proprietary data, literature and patents. And they customise these to create client-specific knowledge graphs with domain specific and/or proprietary data. Their ML models are driven by multi-omics data minimising publication bias and by integrating predictions from different types of multi-omics networks to provide biological insight. Two months ago a study was published using @Euretos’ data-driven selection software to evaluate cell surface biomarkers as potential targets for fluorescence-guided surgery in non-small cell lung cancers.
📉 BioSymetrics is a phenomics-driven drug discovery company that integrates clinical and experimental data using ML creating a phenograph to navigate from phenotypes to genes and to druggable targets. on October 2022, BioSymetrics and Deerfield Management, a healthcare investment firm, announced a five-year joint venture to accelerate the advancement of new therapeutics, with an initial focus on cardiovascular and neurological diseases. BioSymetrics has raised a total funding of $4.87M over 1 round.
📉 Datavant, that specialises in breaking down silos and analysing health data securely and privately, just before this Christmas acquired Swellbox to enable patients 🤒 to request their medical records seamlessly. Swellbox also enables patient authorisation for record retrieval for clinical trial recruitment, long-term surveillance, registry creation and other use cases. On January 17, 2023, Socially Determined, the social risk analytics and data company that is empowering health care organisations to manage risk, improve outcomes and advance equity at scale, announced a partnership with Datavant, that will enable Socially Determined to provide curated, de-identified and linkable social risk data on the patient-level. Datavant has raised a total of $80.5M in funding over 2 rounds.
📉 Genialis is developing next-generation patient classifiers using ML and high-throughput omics data. Its tool, the Genialis ResponderID, is a biomarker discovery platform capable of developing biomarkers for investigational drugs by modelling disease biology rather than relying on treatment response. They offer also a second tool, the Genialis Expressions software, that enables ML driven biomarker discovery by aggregating consistently analysed and annotated data. The Genialis Expressions software is built on FAIR (findability, accessibility, interoperability and reusability) data management principles, in order to analyse sequencing data across numerous NGS platforms. Genialis has raised a total of $13.9M in funding over 4 rounds. Their latest funding was raised on Jan 31, 2023 from a Venture - Series Unknown round.
📉 Owkin develops ML to connect medical researchers with high-quality datasets from leading academic research centres around the world and applies AI to research cohorts and scientific questions. By implementing a causal approach to AI, Owkin is able to simultaneously discover new treatments while identifying new subgroups of patients who would most benefit from them. On January 19, 2023, Nature Medicine published breakthrough Owkin research on the first ever use of federated learning to train deep learning models on multiple hospitals’ histopathology data. Owkin has raised a total of $304.1M in funding over 8 rounds.
📉 PatSnap is the leading Connected Innovation Intelligence platform. In 2022, PatSnap launched Eureka, an AI-powered innovation solutions platform, designed to make intellectual property (IP) accessible for R&D professionals, by translating the legal ⚖️ language of IP into the technical language of R&D 🧪🔬. PatSnap has raised a total of $351.6M in funding over 6 rounds. Their latest funding was raised on Mar 16, 2021 from a Series E round.
📉 Nference, Inc is a science-first software company that partners with medical centres to turn decades of rich and predominantly unstructured data captured in electronic medical records into powerful software solutions that enable scientists to discover and develop the next-generation of personalised diagnostics and treatments. Nference just made it on FastCompany’s list of 10 🔝 most innovative companies in datascience of 2023! Nference has raised a total of $152.7M in funding over 7 rounds.
📉 Snowflake’s Healthcare & Life Sciences Data Cloud allows companies to eliminate data marts, break down silos 🧱, capitalise on near-unlimited performance and create a single source of truth by bringing diverse data together and granting governed access for all users and applications. Snowflake has raised a total of $2B in funding over 10 rounds. Their latest funding was raised on Apr 19, 2022 from a Post-IPO Equity round.
📉 StoneWise in China is using AI to identify and process massive structured data. So far, they developed an ultra-high throughput molecule screening system that allows virtual high-throughput screening at billions magnitude, that it is combined with multi-dimensional molecular generation and optimisation system to enable faster and better drug discovery. Stonewise has raised a total funding of $100M over 4 rounds.
📉 Databricks is offering Lakehouse for Healthcare and Life Sciences. In particular, they offer a single platform that brings together all data (structured and unstructured data — patient, R&D and operations) and analytics workloads (with applications ranging from managing hospital bed capacity to optimising the manufacturing and distribution of pharmaceuticals) to enable transformative innovations in patient care and drug R&D. On December 27, 2022, Quantori, LLC, a leading global provider of data science and digital transformation solutions for life science and healthcare organisations, announced a partnership with Databricks to power innovation across the entire drug lifecycle by unifying data, analytics, and AI on a simple and open multi-cloud platform. Databricks has raised a total of $3.5B in funding over 9 rounds. Their latest funding was raised on Aug 31, 2021 from a Series H round.
📉 Kyndi is a global natural language processing company, an AI-powered platform, being deployed in areas such as supply chain management, manufacturing, healthcare, medical research and financial services. On March 21, 2023, Kyndi Natural Language Platform has been named a 2023 CUSTOMER magazine Product of the Year Award winner 🏆 by the global integrated media company TMC. Kyndi has raised a total of $42.8M in funding over 8 rounds.
📉 Veeva System is a global leader in cloud software for the life sciences. On March 28, 2023, Veeva announced that more than 100 life sciences companies are using Veeva CRM Events Management to plan and execute in-person, virtual, and hybrid events worldwide. Veeva has raised a total of $7M in funding over 2 rounds.
📉 OneThreeBiotech utilises AI to integrate and analyse data from over 30 types of chemical, biological and clinical data allowing researchers to generate new insights during drug development. The company is collaborating with PoolbegPharma, that is applying OneThreeBiotech’s ATLANTIS platform to identify novel drug targets and signatures driving respiratory syncytial virus infection. OneThreeBiotech has raised a total funding of $2.8M over 1 round.
Until next time,