Biomedical NLP papers
Papers posted on @[email protected] (Clinical, Healthcare & Biomedical NLP)
Paper • 2412.20070 • Published • 39Note In this study, we attempted to employ compositional generalization (CG)-the ability of models to understand novel combinations by recombining learned elements-as a guiding framework. Since medical images can be precisely defined by Modality, Anatomical area, and Task, naturally providing an environment for exploring CG. We assembled 106 medical datasets to create Med-MAT.
MEDEC: A Benchmark for Medical Error Detection and Correction in Clinical Notes
Paper • 2412.19260 • Published • 1Note In this paper, we introduce the first publicly available benchmark for medical error detection and correction in clinical notes, covering 5 types of errors (Diagnosis, Management, Treatment, Pharmacotherapy, and Causal Organism). MEDEC consists of 3,848 clinical texts, including 488 clinical notes from three US hospital systems that were not previously seen by any LLM.
HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs
Paper • 2412.18925 • Published • 82Note In this work, we propose verifiable medical problems with a medical verifier to check the correctness of model outputs. We use the verifier to guide the search for a complex reasoning trajectory for fine-tuning, and apply reinforcement learning with verifier-based rewards to enhance complex reasoning further. Finally, we introduce HuatuoGPT-o1, a medical LLM capable of complex reasoning.
MedMax: Mixed-Modal Instruction Tuning for Training Biomedical Assistants
Paper • 2412.12661 • PublishedNote In this work, we present MedMax, the first large-scale multimodal biomedical instruction-tuning dataset for mixed-modal foundation models. With 1.47 million instances, MedMax encompasses a diverse range of tasks, including multimodal content generation (interleaved image-text data), biomedical image captioning and generation, visual chatting, and report understanding.
NoteContrast: Contrastive Language-Diagnostic Pretraining for Medical Text
Paper • 2412.11477 • PublishedNote In this paper, we developed an approach for automated adjudication of medical notes based on i) models for ICD-10 diagnostic code sequences using a large real-world data set, ii) large language models for medical notes, and iii) contrastive pre-training to build an integrated model of both ICD-10 diagnostic codes and corresponding medical text.
Clinical Document Corpora and Assorted Domain Proxies: A Survey of Diversity in Corpus Design, with Focus on German Text Data
Paper • 2412.00230 • Published • 1Note In this study, we investigate the various types of domain proxies used as substitutes for authentic clinical documents, including machine translation of English clinical datasets and the generation of synthetic corpora with fictitious clinical contents, as well as other types of common proxies such as journal publications.
AfriMed-QA: A Pan-African, Multi-Specialty, Medical Question-Answering Benchmark Dataset
Paper • 2411.15640 • Published • 4Note In this work, we introduce AfriMed-QA, the first large scale Pan-African English multi-specialty medical Question-Answering (QA) dataset, 15,000 questions (open and closed-ended) sourced from over 60 medical schools across 16 countries, covering 32 medical specialties. We further evaluate 30 LLMs across multiple axes including correctness and demographic bias.
ClinicalBench: Can LLMs Beat Traditional ML Models in Clinical Prediction?
Paper • 2411.06469 • Published • 17Note In this paper, we introduce a new benchmark ClinicalBench to comprehensively study the clinical predictive modeling capacities of both general-purpose and medical LLMs, and compare them with traditional ML models. ClinicalBench embraces three common clinical prediction tasks, two databases, 14 general-purpose LLMs, 8 medical LLMs, and 11 traditional ML models.
Medical Adaptation of Large Language and Vision-Language Models: Are We Making Progress?
Paper • 2411.04118 • Published • 1Note In this paper, we compare seven public "medical" LLMs and two VLMs against their corresponding base models, arriving at a different conclusion: all medical VLMs and nearly all medical LLMs fail to consistently improve over their base models in the zero-/few-shot prompting regime for medical question-answering (QA) tasks.
From Medprompt to o1: Exploration of Run-Time Strategies for Medical Challenge Problems and Beyond
Paper • 2411.03590 • Published • 9Note In this work, we evaluate the o1-preview model across various medical benchmarks. Without prompting techniques, o1-preview largely outperforms the GPT-4 series with Medprompt. We however found that few-shot prompting hinders o1's performance, suggesting that in-context learning may no longer be an effective steering approach for reasoning models.
MedINST: Meta Dataset of Biomedical Instructions
Paper • 2410.13458 • Published • 6Note In this paper, we introduce MedINST, the Meta Dataset of Biomedical Instructions, a novel multi-domain, multi-task instructional meta-dataset. MedINST comprises 133 biomedical NLP tasks and over 7 million training samples. Using MedINST as the meta dataset, we curate MedINST32, a challenging benchmark with different task difficulties aiming to evaluate LLMs' generalization ability.
Efficiently Democratizing Medical LLMs for 50 Languages via a Mixture of Language Family Experts
Paper • 2410.10626 • Published • 38Note In this work, we propose a novel MoE routing method that employs language-specific experts and cross-lingual routing. Inspired by circuit theory, our routing analysis revealed a Spread Out in the End information flow mechanism: while earlier layers concentrate cross-lingual information flow, the later layers exhibit language-specific divergence.
CasiMedicos-Arg: A Medical Question Answering Dataset Annotated with Explanatory Argumentative Structures
Paper • 2410.05235 • Published • 2Note In this paper, we present a multilingual dataset for Medical QA where correct and incorrect diagnoses for a clinical case are enriched with a texual explanation written by doctors and manually annotated, resulting in a dataset of 558 clinical cases in four languages with 5k claims, 2k premises, 2k support relations, and 1k attack relations.
Named Clinical Entity Recognition Benchmark
Paper • 2410.05046 • Published • 17Note This technical report introduces a Named Clinical Entity Recognition Benchmark for evaluating language models in healthcare. The leaderboard provides a standardized platform for assessing diverse language models on their ability to identify and classify clinical entities across multiple medical domains. These entities are standardized according to the OMOP data model.
Still Not Quite There! Evaluating Large Language Models for Comorbid Mental Health Diagnosis
Paper • 2410.03908 • PublishedNote In this study, we introduce a benchmark for depression-anxiety comorbidity classification from social media posts comprising 2876 meticulously annotated posts by expert psychologists and 7667 silver-labeled posts. ANGST uses multi-label classification, allowing each post to be simultaneously identified as indicating depression and/or anxiety.
MedVisionLlama: Leveraging Pre-Trained Large Language Model Layers to Enhance Medical Image Segmentation
Paper • 2410.02458 • Published • 9Note This study explores enhancing Vision Transformers for medical image segmentation by integrating pre-trained LLM transformer blocks. We propose a Hybrid Attention Mechanism that combines global and local feature learning with a Multi-Scale Fusion Block for aggregating features across different scales.
MedQA-CS: Benchmarking Large Language Models Clinical Skills Using an AI-SCE Framework
Paper • 2410.01553 • PublishedNote In this work, we introduce MedQA-CS, an AI-SCE framework inspired by medical education's Objective Structured Clinical Examinations. MedQA-CS evaluates LLMs through two instruction-following tasks, LLM-as-medical-student and LLM-as-CS-examiner, designed to reflect real clinical scenarios.
Adapting LLMs for the Medical Domain in Portuguese: A Study on Fine-Tuning and Model Evaluation
Paper • 2410.00163 • PublishedNote This study evaluates the performance of large language models (LLMs) as medical agents in Portuguese, aiming to develop a reliable and relevant virtual assistant for healthcare professionals. The HealthCareMagic-100k-en and MedQuAD datasets, translated from English using GPT-3.5, were used to fine-tune the ChatBode-7B model using the PEFT-QLoRA method.
MedHalu: Hallucinations in Responses to Healthcare Queries by Large Language Models
Paper • 2409.19492 • PublishedNote In this work, we propose a carefully crafted first-of-its-kind medical hallucination dataset with a diverse range of health-related topics and the corresponding hallucinated responses from LLMs with labeled hallucination types and hallucinated text spans. We also introduce MedHaluDetect framework to evaluate capabilities of various LLMs in detecting hallucinations.
INSIGHTBUDDY-AI: Medication Extraction and Entity Linking using Large Language Models and Ensemble Learning
Paper • 2409.19467 • PublishedNote In this work, we investigate state-of-the-art LLMs in text mining tasks on medications and their related attributes such as dosage, route, strength, and adverse effects. In addition, we explore different ensemble learning methods (Stack-Ensemble and Voting-Ensemble) to augment the model performances from individual LLMs.
Efficient and Personalized Mobile Health Event Prediction via Small Language Models
Paper • 2409.18987 • PublishedNote This paper examines the capability of SLMs to accurately analyze health data, such as steps, calories, sleep minutes, and other vital statistics, to assess an individual's health status. Our results indicate that SLMs could potentially be deployed on wearable or mobile devices for real-time health monitoring.
A Preliminary Study of o1 in Medicine: Are We Closer to an AI Doctor?
Paper • 2409.15277 • Published • 35Note This report provides a comprehensive exploration of o1 on different medical scenarios, examining 3 key aspects: understanding, reasoning, and multilinguality. Specifically, our evaluation encompasses 6 tasks using data from 37 medical datasets, including two newly constructed and more challenging question-answering (QA) tasks based on professional medical quizzes.
Beyond Fine-tuning: Unleashing the Potential of Continuous Pretraining for Clinical LLMs
Paper • 2409.14988 • Published • 22Note In this study, we investigate the efficacy of four techniques in adapting LLMs for clinical use-cases: continuous pretraining, instruct fine-tuning, NEFTune, and prompt engineering. Our evaluation across various clinical tasks reveals the impact of each technique.
MultiMed: Multilingual Medical Speech Recognition via Attention Encoder Decoder
Paper • 2409.14074 • PublishedNote In this work, we introduce MultiMed, a collection of small-to-large end-to-end ASR models for the medical domain, spanning five languages: Vietnamese, English, German, French, and Mandarin Chinese, together with the corresponding real-world ASR dataset. Secondly, we establish the empirical baselines.
JMedBench: A Benchmark for Evaluating Japanese Biomedical Large Language Models
Paper • 2409.13317 • PublishedNote In this work, we propose a new benchmark including eight LLMs across four categories and 20 Japanese biomedical datasets across five tasks. Moreover, we offer insights that could further enhance development in this field. Our evaluation tools tailored to our benchmark as well as the datasets are publicly available.
DILA: Dictionary Label Attention for Mechanistic Interpretability in High-dimensional Multi-label Medical Coding Prediction
Paper • 2409.10504 • Published • 1Note In this work, we propose a mechanistic interpretability module that disentangles dense embeddings into a sparse embedding space, where nonzero elements represent globally learned medical concepts. Our LLM-based feature identification pipeline uncovered these concepts by summarizing the highest activating tokens by feature.
MEDIC: Towards a Comprehensive Framework for Evaluating LLMs in Clinical Applications
Paper • 2409.07314 • Published • 50Note In this work, we introduce MEDIC, a framework assessing LLMs across 5 competences: medical reasoning, ethics and bias, data and language understanding, in-context learning, and clinical safety. MEDIC features a novel cross-examination framework quantifying LLM performance across areas like coverage and hallucination detection, without requiring reference outputs.
Biomedical Large Languages Models Seem not to be Superior to Generalist Models on Unseen Medical Data
Paper • 2408.13833 • PublishedNote In this study, we evaluated their performance on clinical case challenges from biomedical journals and on several clinical tasks (e.g., information extraction, document summarization, and clinical coding). We found that biomedical LLMs mostly perform inferior to their general-purpose counterparts, especially on tasks not focused on knowledge.
MultiMed: Massively Multimodal and Multitask Medical Understanding
Paper • 2408.12682 • PublishedNote In this work, we present MultiMed, a benchmark designed to evaluate and enable large-scale learning across a wide spectrum of medical modalities and tasks. MultiMed consists of 2.56 million samples across ten medical modalities such as medical reports, pathology, genomics, and protein data, and is structured into eleven challenging tasks.
Towards Evaluating and Building Versatile Large Language Models for Medicine
Paper • 2408.12547 • PublishedNote In this study, we present MedS-Bench, a comprehensive benchmark designed to evaluate the performance of LLMs in clinical contexts, spanning 11 clinical tasks. We also developed MedS-Ins, a large-scale instruction tuning dataset for medicine which comprises 58 medically oriented language corpora, totaling 13.5 million samples across 122 tasks.
RealMedQA: A pilot biomedical question answering dataset containing realistic clinical questions
Paper • 2408.08624 • PublishedNote In this work, we present RealMedQA, a dataset of realistic clinical questions generated by humans and an LLM. We describe the process for generating and verifying the QA pairs and assess several QA models on BioASQ and RealMedQA to assess the relative difficulty of matching answers to questions.
Fine-tuning Large Language Models with Human-inspired Learning Strategies in Medical Question Answering
Paper • 2408.07888 • Published • 11Note In this study, we extend previous research by evaluating both curriculum-based and non-curriculum-based learning strategies across multiple LLMs, using human-defined and automated data labels for medical question answering. Our results indicate a moderate impact of using human-inspired learning strategies for fine-tuning LLMs.
Med42-v2: A Suite of Clinical LLMs
Paper • 2408.06142 • Published • 50Note Med42-v2 introduces a suite of clinical LLMs designed to address the limitations of generic models in healthcare settings. These models are built on Llama3 architecture and fine-tuned using specialized clinical data. They underwent multi-stage preference alignment to respond well to natural prompts, understand clinical queries, perform reasoning tasks, and provide valuable assistance in clinical environments.
Medical Graph RAG: Towards Safe Medical Large Language Model via Graph Retrieval-Augmented Generation
Paper • 2408.04187 • Published • 3Note In this paper, we introduce a novel graph-based RAG framework for the medical domain. Entities extracted from carefully-chunked publications are used to create a 3-tier hierarchical graph structure, linking entities to foundational medical knowledge. They are then interconnected by similarity to form meta-graphs, used with the U-retrieve approach.
BioMamba: A Pre-trained Biomedical Language Representation Model Leveraging Mamba
Paper • 2408.02600 • Published • 8Note In this paper, we present BioMamba, a pre-trained model specifically designed for biomedical text mining. BioMamba builds upon the Mamba architecture and is pre-trained on an extensive corpus of biomedical literature. Our empirical studies demonstrate that BioMamba significantly outperforms existing models like BioBERT across various biomedical tasks.
MedSyn: LLM-based Synthetic Medical Text Generation Framework
Paper • 2408.02056 • Published • 1Note In this study, we introduce MedSyn, a framework that combines large language models with a Medical Knowledge Graph (MKG) to generate synthetic medical text. We use MKG to sample prior medical information for prompts and generate synthetic clinical notes with GPT-4 and fine-tuned LLaMA models. We assess the effectiveness of synthetic data by applying it to the ICD code prediction task.
BioRAG: A RAG-LLM Framework for Biological Question Reasoning
Paper • 2408.01107 • PublishedNote In this paper, we introduce BioRAG, a novel RAG framework for biological question reasoning. We process 22 million scientific papers*to build a comprehensive knowledge base and train a specialized embedding model. Additionally, we enhance vector retrieval with a domain-specific knowledge hierarchy and iterative retrieval for up-to-date information.
Improving Retrieval-Augmented Generation in Medicine with Iterative Follow-up Questions
Paper • 2408.00727 • Published • 1Note In this article, we propose i-MedRAG, an iterative RAG system for medical question-answering. i-MedRAG enhances traditional RAG by allowing large language models to iteratively ask follow-up queries based on previous attempts. Our experiments demonstrate that i-MedRAG significantly improves performance on complex medical questions compared to vanilla RAG.
CoD, Towards an Interpretable Medical Agent using Chain of Diagnosis
Paper • 2407.13301 • Published • 56Note This study introduces Chain-of-Diagnosis (CoD) to enhance the interpretability of LLM-based medical diagnostics. CoD transforms the diagnostic process into a diagnostic chain that mirrors a physician's thought process, providing a transparent reasoning pathway. Additionally, CoD outputs the disease confidence distribution to ensure transparency in decision-making.
LLMs-in-the-loop Part-1: Expert Small AI Models for Bio-Medical Text Translation
Paper • 2407.12126 • Published • 52Note This study introduces a novel "LLMs-in-the-loop" approach to develop supervised neural machine translation models optimized specifically for medical texts. While LLMs have demonstrated powerful capabilities, this research shows that small, specialized models trained on high-quality in-domain (mostly synthetic) data can outperform even vastly larger LLMs.
Panacea: A foundation model for clinical trial search, summarization, design, and recruitment
Paper • 2407.11007 • PublishedNote In this work, we propose a clinical trial foundation model named Panacea, designed to handle multiple tasks, including trial search, trial summarization, trial design, and patient-trial matching. We also assemble a large-scale dataset, named TrialAlign, to infuse clinical knowledge during pre-training, and TrialInstruct, 200k instructions for fine-tuning.
Large Language Models as Biomedical Hypothesis Generators: A Comprehensive Evaluation
Paper • 2407.08940 • PublishedNote In this study, we construct a dataset of background-hypothesis pairs from biomedical literature, carefully partitioned into a training set, a seen test, and unseen test set (based on publication date). Using this dataset, we assess the hypothesis generation capabilities of top-tier instructed models in zero-shot, few-shot, and fine-tuning settings.
CLIMB: A Benchmark of Clinical Bias in Large Language Models
Paper • 2407.05250 • Published • 2Note This paper introduces CLIMB, a pioneering comprehensive benchmark to evaluate both intrinsic and extrinsic bias in LLMs for clinical decision tasks. Notably, for intrinsic bias, we introduce a novel metric, AssocMAD, to assess the disparities of LLMs across multiple demographic groups. We leverage counterfactual intervention to evaluate extrinsic bias.
How do you know that? Teaching Generative Language Models to Reference Answers to Biomedical Questions
Paper • 2407.05015 • Published • 4Note This paper introduces a biomedical RAG system designed to enhance the reliability of generated responses. The system is based on a LLM fine-tuned to provide references for each output statement, allowing the users to verify the answer, where retrieved relevant abstracts from PubMed are passed to LLM's context
MiniGPT-Med: Large Language Model as a General Interface for Radiology Diagnosis
Paper • 2407.04106 • PublishedNote This article introduces MiniGPT-Med, a vision-language model derived from large-scale language models and tailored for medical applications. The model supports numerous modalities (CT scans, MRIs, ...) and tasks (medical report generation, visual QA, and disease identification).
BioMNER: A Dataset for Biomedical Method Entity Recognition
Paper • 2406.20038 • PublishedNote In this study, we propose a novel dataset for biomedical method entity recognition, employing an automated BioMethod entity recognition and information retrieval system to assist human annotation. Furthermore, we comprehensively explore a range of conventional and contemporary open-domain NER methodologies, including the utilization of cutting-edge LLMs customised to our dataset.
HuatuoGPT-Vision, Towards Injecting Medical Visual Knowledge into Multimodal LLMs at Scale
Paper • 2406.19280 • Published • 61Note In this work, we refined medical image-text pairs from PubMed and employed MLLMs (GPT-4V) in an 'unblinded' capacity to denoise and reformat the data, resulting in the creation of the PubMedVision dataset with 1.3 million medical VQA samples. Using PubMedVision, we train a 34B medical MLLM HuatuoGPT-Vision.
MedCare: Advancing Medical LLMs through Decoupling Clinical Alignment and Knowledge Aggregation
Paper • 2406.17484 • PublishedNote This paper introduces MedCare, an LLM designed for medical tasks which employs a two-stage fine-tuning pipeline to separate clinical alignment from knowledge aggregation. The first stage uses a Knowledge Aggregator and a Noise Aggregator to encode and filter information, while the second leverages an alignment module to prevent knowledge forgetting.
RaTEScore: A Metric for Radiology Report Generation
Paper • 2406.16845 • Published • 4Note This paper introduces a novel, entity-aware metric, termed as Radiological Report (Text) Evaluation (RaTEScore), to assess the quality of medical reports generated by AI models. RaTEScore emphasizes crucial medical entities such as diagnostic outcomes and anatomical details, and is robust against complex medical synonyms and sensitive to negation expressions.
Evaluation of Language Models in the Medical Context Under Resource-Constrained Settings
Paper • 2406.16611 • PublishedNote In this survey, we evaluate 53 language models for the medical domain, focusing on classification and text generation tasks. The considered models are very diverse, ranging from 110M to 13B parameters, spanning the three families of Transformer-based models and from diverse knowledge domains.
EHRCon: Dataset for Checking Consistency between Unstructured Notes and Structured Tables in Electronic Health Records
Paper • 2406.16341 • Published • 12Note This paper presents a new dataset and task specifically designed to ensure data consistency between structured tables and unstructured notes in EHRs. EHRCon was crafted in collaboration with healthcare professionals using the MIMIC-III EHR dataset, and includes manual annotations of 3,943 entities across 105 clinical notes.
Real-time Speech Summarization for Medical Conversations
Paper • 2406.15888 • Published • 1Note In this work, we propose the first deployable real-time speech summarization system for real-world applications in industry, which generates a local summary after every N speech utterances within a conversation and a global summary after the end of a conversation. We also present VietMed-Sum, a speech summarization dataset for medical conversations, and a baseline model trained on it.
Infusing clinical knowledge into tokenisers for language models
Paper • 2406.14312 • PublishedNote This study introduces K-Tokeniser, a technique that initializes the global representations of tokens based on the semantic types of domain concepts (such as drugs or diseases) from either a domain ontology like UMLS or the training data of a task related corpus. At training or inference stage, the context is used to pick the best token representation.
Medical Spoken Named Entity Recognition
Paper • 2406.13337 • PublishedNote In this work, we present VietMed-NER - the first spoken NER dataset in the medical domain. To our best knowledge, our real-world dataset is the largest spoken NER dataset in the world in terms of the number of entity types, featuring 18 distinct types. Secondly, we present baseline results using various state-of-the-art pre-trained models: encoder-only and sequence-to-sequence.
Aqulia-Med LLM: Pioneering Full-Process Open-Source Medical Language Models
Paper • 2406.12182 • PublishedNote In this paper, we propose Aquila-Med, a bilingual medical LLM based on Aquila, trained through continued pre-training, supervised fine-tuning (SFT), and reinforcement learning from human feedback (RLHF). We construct a large-scale Chinese and English medical dataset for continue pre-training and a high-quality SFT dataset, covering extensive medical specialties.
Language Models are Surprisingly Fragile to Drug Names in Biomedical Benchmarks
Paper • 2406.12066 • Published • 8Note In this study, we create a new robustness dataset, RABBITS, to evaluate performance differences on medical benchmarks (MedQA and MedMCQA) before and after swapping brand and generic drug names, using physician expert annotations.
Are Large Language Models True Healthcare Jacks-of-All-Trades? Benchmarking Across Health Professions Beyond Physician Exams
Paper • 2406.11328 • PublishedNote In this paper, we introduce the Examinations for Medical Personnel in Chinese, a pioneering large-scale healthcare knowledge benchmark in traditional Chinese. EMPEC consists of 157,803 exam questions across 124 subjects and 20 healthcare professions, including underrepresented occupations like Optometrists and Audiologists.
CliBench: Multifaceted Evaluation of Large Language Models in Clinical Decisions on Diagnoses, Procedures, Lab Tests Orders and Prescriptions
Paper • 2406.09923 • Published • 1Note In this work, we introduce a novel benchmark developed from the MIMIC IV dataset, offering a comprehensive and realistic assessment of LLMs' capabilities in clinical diagnosis. This benchmark not only covers diagnoses from a diverse range of medical cases but also incorporates tasks of clinical significance.
Leveraging Large Language Models for Knowledge-free Weak Supervision in Clinical Natural Language Processing
Paper • 2406.06723 • PublishedNote In this article, we propose a novel approach for weak supervision in clinical natural language processing by leveraging fine-tuned LLMs to generate weakly-labeled data. We utilize this data to train a downstream BERT model, which is then further fine-tuned on a small set of gold standard data.
MedFuzz: Exploring the Robustness of Large Language Models in Medical Question Answering
Paper • 2406.06573 • Published • 9Note In this study, we introduce an adversarial method designed to test the robustness of LLM in medical QA by modifying benchmark questions to confound the LLM. We target the MedQA benchmark's strong assumptions about patient characteristics and demonstrate successful "attacks" that trick the LLM into incorrect answers.
Towards a Personal Health Large Language Model
Paper • 2406.06474 • Published • 18Note In this paper, we introduce the Personal Health Large Language Model (PH-LLM), which is fine-tuned from Gemini to interpret and reason over numerical time-series data related to personal health. We developed three datasets to evaluate PH-LLM's capabilities in generating personalized insights from sleep patterns and physical activity.
MedExQA: Medical Question Answering Benchmark with Multiple Explanations
Paper • 2406.06331 • PublishedNote This paper introduces MedExQA, a novel benchmark in medical QA to evaluate LLMs understanding of medical knowledge through explanations. By constructing datasets across five distinct currently-underrepresented medical specialties and by further incorporating multiple explanations for each question-answer pair, we address a major gap in current medical QA benchmarks.
MAIRA-2: Grounded Radiology Report Generation
Paper • 2406.04449 • PublishedNote In this work, we introduce a large multimodal model that combines a radiology-specific image encoder with a LLM for the task of grounded report generation on chest X-rays. The model utilizes comprehensive inputs including current and prior images and reports, as well as sections of the current report, to improve report quality and reduce hallucinations.
UltraMedical: Building Specialized Generalists in Biomedicine
Paper • 2406.03949 • PublishedNote In this paper, we present the UltraMedical collections, which consist of high-quality manual and synthetic datasets in the biomedicine domain, featuring preference annotations across multiple advanced LLMs. By utilizing these datasets, we fine-tune a suite of specialized medical models based on Llama-3 series, demonstrating breathtaking capabilities across various medical benchmarks.
Enhancing Adverse Drug Event Detection with Multimodal Dataset: Corpus Creation and Model Development
Paper • 2405.15766 • PublishedNote In this work, we present a MultiModal Adverse Drug Event (MMADE) detection dataset, merging ADE-related textual information with visual aids. Additionally, we introduce a framework that leverages the capabilities of LLMs and VLMs for ADE detection by generating detailed descriptions of medical images depicting ADEs.
Structural Entities Extraction and Patient Indications Incorporation for Chest X-ray Report Generation
Paper • 2405.14905 • PublishedNote In this paper, we introduce a novel method, Structural Entities extraction and patient indications Incorporation (SEI) for chest X-ray report generation. Specifically, we employ a structural entities extraction (SEE) approach to eliminate presentation-style vocabulary in reports and improve the quality of factual entity sequences.
OLAPH: Improving Factuality in Biomedical Long-form Question Answering
Paper • 2405.12701 • Published • 1Note In this article, we introduce MedLFQA, a benchmark dataset reconstructed using long-form question-answering datasets related to the biomedical domain. We also propose OLAPH, a framework that enables the improvement of factuality through automatic evaluations by iteratively training LLMs to mitigate hallucinations, using sampling predictions and preference optimization.
COGNET-MD, an evaluation framework and dataset for Large Language Model benchmarks in the medical domain
Paper • 2405.10893 • PublishedNote In this technical paper, we outline Cognitive Network Evaluation Toolkit for Medical Domains (COGNET-MD), a novel benchmark for LLM evaluation in the medical domain. Specifically, we propose a scoring-framework with increased difficulty to assess the ability of LLMs in interpreting medical text, accompanied with MCQAs.
Adapting Abstract Meaning Representation Parsing to the Clinical Narrative -- the SPRING THYME parser
Paper • 2405.09153 • PublishedNote This paper is dedicated to the design and evaluation of the first AMR parser tailored for clinical notes. Our objective was to facilitate the precise transformation of the clinical notes into structured AMR expressions, thereby enhancing the interpretability and usability of clinical text data at scale.
MedConceptsQA -- Open Source Medical Concepts QA Benchmark
Paper • 2405.07348 • PublishedNote We present MedConceptsQA, a dedicated open source benchmark for medical concepts question answering. The benchmark comprises of questions of various medical concepts across different vocabularies: diagnoses, procedures, and drugs. The questions are categorized into three levels of difficulty: easy, medium, and hard. We conducted evaluations of the benchmark using various LLMs.
Cross-Care: Assessing the Healthcare Implications of Pre-training Data on Language Model Bias
Paper • 2405.05506 • Published • 1Note In this study, we introduce Cross-Care, the first benchmark framework dedicated to assessing biases and real world knowledge in LLMs, specifically focusing on the representation of disease prevalence across diverse demographic groups. We systematically evaluate how demographic biases embedded in pre-training corpora influence the outputs of LLMs.
Agent Hospital: A Simulacrum of Hospital with Evolvable Medical Agents
Paper • 2405.02957 • Published • 1Note In this paper, we introduce a simulacrum of hospital called Agent Hospital that simulates the entire process of treating illness. As the simulacrum can simulate disease onset and progression based on knowledge bases and LLMs, doctor agents can keep accumulating experience from both successful and unsuccessful cases. We show this knowledge is applicable to real-world benchmarks.
Aloe: A Family of Fine-tuned Open Healthcare LLMs
Paper • 2405.01886 • Published • 3Note In this work, we introduce the Aloe family, a set of open medical LLMs highly competitive within its scale range. Aloe models are trained on Mistral and LLaMA 3, using a new custom dataset which combines public data sources improved with synthetic CoT, with instruct tuning, model merging, alignment, red teaming and advanced inference schemes as improvement strategies.
Capabilities of Gemini Models in Medicine
Paper • 2404.18416 • Published • 23Note In this report, we introduce Med-Gemini, a family of highly capable multimodal models that are specialized in medicine with the ability to seamlessly use web search, and that can be efficiently tailored to novel modalities using custom encoders. We evaluate Med-Gemini on 14 medical benchmarks, establishing new state-of-the-art performance on 10 of them.
Hippocrates: An Open-Source Framework for Advancing Large Language Models in Healthcare
Paper • 2404.16621 • PublishedNote We present Hippocrates, an open-source LLM framework specifically developed for the medical domain. It offers unrestricted access to its training datasets, codebase, checkpoints, and evaluation protocols. This open approach is designed to stimulate collaborative research, allowing the community to build upon, refine, and rigorously evaluate medical LLMs.
Med42 -- Evaluating Fine-Tuning Strategies for Medical LLMs: Full-Parameter vs. Parameter-Efficient Approaches
Paper • 2404.14779 • Published • 1Note This study presents a comprehensive analysis and comparison of full-parameter vs parameter-efficient tuning, within the context of medical LLMs. We developed and refined a series of LLMs, based on the Llama-2 architecture, specifically designed to enhance medical knowledge retrieval, reasoning, and question-answering capabilities.
emrQA-msquad: A Medical Dataset Structured with the SQuAD V2.0 Framework, Enriched with emrQA Medical Information
Paper • 2404.12050 • PublishedNote In this work, we introduce emrQA-msquad, a medical dataset structured with the SQuAD V2.0 framework and enriched with emrQA medical information. It comprises 160k questions and 4k manually obtained answers, aimed at enhancing the accuracy of Medical QA systems. We also finetuned BERT-type models on the dataset.
MoE-TinyMed: Mixture of Experts for Tiny Medical Large Vision-Language Models
Paper • 2404.10237 • PublishedNote In this work, we developed MoE-TinyMed, a model tailored for medical applications that significantly lowers parameter demands. In evaluations on the VQA-RAD, SLAKE, and Path-VQA datasets, MoE-TinyMed outperformed LLaVA-Med in all Med-VQA closed settings with just 3.6B parameters.
Medical mT5: An Open-Source Multilingual Text-to-Text LLM for The Medical Domain
Paper • 2404.07613 • PublishedNote In this paper, we address these shortcomings by compiling, to the best of our knowledge, the largest multilingual corpus for the medical domain in four languages, namely English, French, Italian and Spanish. This new corpus has been used to train Medical mT5, the first open-source text-to-text multilingual model for the medical domain.
MedExpQA: Multilingual Benchmarking of Large Language Models for Medical Question Answering
Paper • 2404.05590 • PublishedNote In this paper we present MedExpQA, the first multilingual benchmark based on medical exams to evaluate LLMs in Medical Question Answering. To the best of our knowledge, MedExpQA includes for the first time reference gold explanations written by doctors which can be leveraged to establish various gold-based upper-bounds for comparison with LLMs performance.
Small Language Models Learn Enhanced Reasoning Skills from Medical Textbooks
Paper • 2404.00376 • Published • 3Note We introduce Meerkat-7B, a novel medical AI system with 7 billion parameters. Meerkat-7B was trained using our new synthetic dataset consisting of high-quality chain-of-thought reasoning paths sourced from 18 medical textbooks, along with diverse instruction-following datasets.
Evaluating Large Language Models for Health-Related Text Classification Tasks with Public Social Media Data
Paper • 2403.19031 • PublishedNote In this paper, we benchmarked various machine learning models, including classic SVMs, pretrained language models like RoBERTa, BERTweet, and SocBERT, and LLMs such as GPT-3.5 and GPT-4, across six text classification tasks using public social media data. We use LLMs either zero-shot, as annotator, or for data augmentation.
BioMedLM: A 2.7B Parameter Language Model Trained On Biomedical Text
Paper • 2403.18421 • Published • 22Note In this article, we release BioMedLM, a 2.7 billion parameter GPT-style autoregressive model trained exclusively on PubMed abstracts and full articles. When fine-tuned, BioMedLM can produce strong multiple-choice biomedical QA results competitive with much larger models, such as achieving a score of 57.3% on MedMCQA (dev) and 69.0% on the MMLU Medical Genetics exam.
A Dataset for Pharmacovigilance in German, French, and Japanese: Annotating Adverse Drug Reactions across Languages
Paper • 2403.18336 • PublishedNote This work presents a multilingual corpus of texts concerning ADRs gathered from diverse sources, including patient fora, social media, and clinical reports in German, French, and Japanese. Our corpus contains annotations covering 12 entity types, four attribute types, and 13 relation types.
Large Language Models in Biomedical and Health Informatics: A Bibliometric Review
Paper • 2403.16303 • Published • 1Note In this review, we conducted a bibliometric analysis of research articles and collaboration networks from 2022 to 2023 to understand the application of LLMs in Biomedical and Health Informatics. We mapped out key trends and major developments, highlighting how LLMs enhance NLP applications in medical diagnosis, patient engagement, and personalized medicine.
Large Language Model for Mental Health: A Systematic Review
Paper • 2403.15401 • PublishedNote In this review, we discuss the research methodology used in the paper. The methodology chapter explains the data collection and analysis methods, including the type of research conducted, data collection techniques, and any tools or materials used. It also justifies the methodological choices made, allowing readers to evaluate the reliability and validity of the research.
Polaris: A Safety-focused LLM Constellation Architecture for Healthcare
Paper • 2403.13313 • Published • 2Note We develop Polaris, the first safety-focused LLM constellation for real-time patient-AI healthcare conversations. Unlike prior LLM works in healthcare, our work specifically focuses on long multi-turn voice conversations. We train our models on proprietary data, clinical care plans, healthcare regulatory documents, medical manuals, and other medical reasoning documents.
Electrocardiogram Instruction Tuning for Report Generation
Paper • 2403.04945 • Published • 1Note we propose the Multimodal ECG Instruction Tuning (MEIT) framework, the first attempt to tackle ECG report generation with LLMs and multimodal instructions. To facilitate future research, we establish a benchmark to evaluate MEIT with various LLMs backbones across two large-scale ECG datasets. Our approach uniquely aligns the representations of the ECG signal and the report.
Apollo: Lightweight Multilingual Medical LLMs towards Democratizing Medical AI to 6B People
Paper • 2403.03640 • Published • 2Note In this article, we describe both the creation of the ApolloCorpora multilingual medical dataset and the XMedBench benchmark, and the training of our Apollo models, state-of-the-art LLMs of various relatively-small sizes (i.e., 0.5B, 1.8B, 2B, 6B, and 7B) which are capable of answering queries in the six most widely spoken languages world-wide.
To Generate or to Retrieve? On the Effectiveness of Artificial Contexts for Medical Open-Domain Question Answering
Paper • 2403.01924 • Published • 4Note This paper presents MedGENIE, the first generate-then-read framework for multiple-choice question answering in medicine, which entails constructing artificial contexts through prompting instead of retreiving the context from PubMed. We conduct extensive experiments on MedQA-USMLE, MedMCQA, and MMLU.
Leveraging Biomolecule and Natural Language through Multi-Modal Learning: A Survey
Paper • 2403.01528 • Published • 1Note In this review, we provide an extensive analysis of recent advancements achieved through cross modeling of biomolecules and natural language. The study begins with an overview of biomolecular representations and delves into the integration of linguistic and molecular data, assessing its practical applications and resources.
KorMedMCQA: Multi-Choice Question Answering Benchmark for Korean Healthcare Professional Licensing Examinations
Paper • 2403.01469 • PublishedNote We introduce KorMedMCQA, the first Korean multiple-choice QA benchmark derived from Korean healthcare professional licensing examinations, covering from the year 2012 to year 2023. This dataset consists of a selection of questions from the license examinations for doctors, nurses, and pharmacists, featuring a diverse array of subjects.
MediSwift: Efficient Sparse Pre-trained Biomedical Language Models
Paper • 2403.00952 • PublishedNote In this work, we introduce MediSwift, a suite of efficient sparse pre-trained biomedical language models. By inducing up to 75% weight sparsity during pre-training on biomedical text data. The models are further refined through dense fine-tuning and strategic soft prompting, achieving sota results on several biomedical tasks.
Benchmarking Large Language Models on Answering and Explaining Challenging Medical Questions
Paper • 2402.18060 • Published • 1Note In this study, we construct two new datasets: JAMA Clinical Challenge and Medbullets. The first consists of questions based on challenging clinical cases, while the second comprises USMLE Step 2&3 style clinical questions. Both datasets are structured as multiple-choice QA tasks, where each question is accompanied by an expert-written explanation.
Adaptation of Biomedical and Clinical Pretrained Models to French Long Documents: A Comparative Study
Paper • 2402.16689 • Published • 1Note In this paper, we present a comparative study of three adaptation strategies for long-sequence models, leveraging the Longformer architecture. We conducted evaluations of these models on 16 downstream tasks. Our findings reveal that further pre-training an English clinical model with French biomedical texts can outperform alternatives.
Towards Building Multilingual Language Model for Medicine
Paper • 2402.13963 • Published • 4Note In this paper, we aim to develop an open-source, multilingual language model for medicine, that the benefits a wider, linguistically diverse audience from different regions. We construct MMedC, a new multilingual medical corpus, (25.5B tokens across 6 languages), a new MCQA benchmark with rationale. We then finetuned several LLMs and evaluated them on the benchmark.
Benchmarking Retrieval-Augmented Generation for Medicine
Paper • 2402.13178 • Published • 5Note This work proposes the Medical Information Retrieval-Augmented Generation Evaluation (MIRAGE), a first-of-its-kind benchmark including 7,663 questions from five medical QA datasets, and discovers a log-linear scaling property and the "lost-in-the-middle"effects in medical RAG.
Efficiency at Scale: Investigating the Performance of Diminutive Language Models in Clinical Tasks
Paper • 2402.10597 • Published • 2Note In this study, we compare different Parameter Efficient Fine-tuning (PEFT) methods for clinical natural language processing tasks, using various sizes of language models. We evaluate the performance of these methods on three clinical tasks: de-identification, assertion detection, and mortality prediction.
BioMistral: A Collection of Open-Source Pretrained Large Language Models for Medical Domains
Paper • 2402.10373 • Published • 9Note In this paper, we introduce BioMistral, an open-source LLM for the biomedical domain, utilizing Mistral as its foundation and further pre-trained on PubMed Central. We conduct a comprehensive evaluation of BioMistral on 10 medical QA datasets in English. We also explore lightweight models obtained through quantization and model merging approaches.
Gemini Goes to Med School: Exploring the Capabilities of Multimodal Large Language Models on Medical Challenge Problems & Hallucinations
Paper • 2402.07023 • Published • 4Note In this work, we evaluate both open-source and Google’s new multimodal LLM called Gemini across medical reasoning, hallucination detection, and medical visual question answering tasks. We also perform a detailed analysis by medical subject and test type. We release a Python module for medical LLM evaluation.
RareBench: Can LLMs Serve as Rare Diseases Specialists?
Paper • 2402.06341 • PublishedNote In this work, we introduce RareBench, a novel benchmark for assessing the performance of large language models (LLMs) on rare disease diagnosis and analysis. We also provide a rich dataset of rare disease cases, and a novel method to generate dynamic prompts using a rare disease knowledge graph. Our results show that our method improves LLMs’ diagnostic accuracy and interpretability.
Benchmarking Large Language Models on Communicative Medical Coaching: a Novel System and Dataset
Paper • 2402.05547 • Published • 1Note We introduce ChatCoach, a system that helps medical students improve their communication skills with patients. It uses two AI agents: one that acts as a patient and one that acts as a coach. The student can talk to the patient agent and get feedback from the coach agent in real time. We compare the performance of ChatGPT and Llama2 for this task.
SA-MDKIF: A Scalable and Adaptable Medical Domain Knowledge Injection Framework for Large Language Models
Paper • 2402.00474 • PublishedNote In this study, we present SA-MDKIF, a framework that aims to inject medical knowledge into LLMs through instruction tuning, thereby enabling adaptability for various downstream tasks. SA-MDKIF consists of two stages: skill training and skill adaptation. We train a skill router to integrate the acquired skills with LLMs during inference.
Multimodal Clinical Pseudo-notes for Emergency Department Prediction Tasks using Multiple Embedding Model for EHR (MEME)
Paper • 2402.00160 • PublishedNote In this work, we introduce MEME, an approach that views EHR as multimodal data. This approach incorporates "pseudo-notes", textual representations of tabular EHR concepts such as diagnoses and medications, and allows us to effectively employ LLMs for EHR representation.
Contrastive Learning and Mixture of Experts Enables Precise Vector Embeddings
Paper • 2401.15713 • Published • 2Note In this paper, we target this issue by assembling niche datasetsusing co-citations as a similarity metric, focusing on biomedical domains. We employ two keystrategies: 1. Domain-specific Fine-Tuning, and 2. Universal Applicability with Mixture of Experts (MoE), adapting pretrained models with enforced routing for multiple domains simultaneously.
K-QA: A Real-World Medical Q&A Benchmark
Paper • 2401.14493 • PublishedNote We construct K-QA, a dataset containing 1,212 patient questions originating from real-world conversations held on K Health. We employ a panel of in-house physicians to answer and manually decompose a subset of K-QA into self-contained statements. Additionally, we formulate two NLI-based evaluation metrics. Finally, we use K-QA along with these metrics to evaluate several state-of-the-art models.
LongHealth: A Question Answering Benchmark with Long Clinical Documents
Paper • 2401.14490 • Published • 3Note We present the LongHealth benchmark, comprising 20 detailed fictional patient cases across various diseases, with each case containing 5,090 to 6,754 words. The benchmark challenges LLMs with 400 multiple-choice questions in three categories: information extraction, negation, and sorting, challenging LLMs to extract and interpret information from large clinical documents.
PubTator 3.0: an AI-powered Literature Resource for Unlocking Biomedical Knowledge
Paper • 2401.11048 • Published • 2Note PubTator 3.0 is a biomedical literature resource using state-of-the-art AI techniques to offer semantic and relation searches for key concepts like proteins, genetic variants, diseases, and chemicals. It provides over one billion entity and relation annotations across approximately 36 million PubMed abstracts and 6 million full-text articles.
Towards Conversational Diagnostic AI
Paper • 2401.05654 • Published • 16Note In this work, we introduce AMIE (Articulate Medical Intelligence Explorer), a LLM-based AI system optimized for diagnostic dialogue. AMIE uses a novel self-play based simulated environment with automated feedback mechanisms for scaling learning across diverse disease conditions, specialties, and contexts.
PeFoMed: Parameter Efficient Fine-tuning on Multimodal Large Language Models for Medical Visual Question Answering
Paper • 2401.02797 • PublishedNote In this paper, we propose a parameter efficient framework for fine-tuning MLLM specifically tailored to Med-VQA applications, and empirically validate it on a public benchmark dataset. We outperform the GPT-4v model by a significant margin of 26% absolute accuracy on closed-ended questions, based on a human evaluation.
Generalist embedding models are better at short-context clinical semantic search than specialized embedding models
Paper • 2401.01943 • Published • 6Note This study addresses these questions by constructing a textual dataset based on the ICD-10-CM code descriptions, widely used in US hospitals and containing many clinical terms, and their easily reproducible rephrasing. We then benchmarked existing embedding models, either generalist or specialized in the clinical domain.
MedSumm: A Multimodal Approach to Summarizing Code-Mixed Hindi-English Clinical Queries
Paper • 2401.01596 • PublishedNote This work introduces the task of multimodal medical question summarization for codemixed input in a low-resource setting. To address this gap, we introduce the Multimodal Medical Codemixed Question Summarization MMCQS dataset, which combines Hindi-English codemixed medical queries with visual aids.
Exploring the Effectiveness of Instruction Tuning in Biomedical Language Processing
Paper • 2401.00579 • Published • 2Note Our study investigates the potential of instruction tuning for biomedical language processing, applying this technique to two general LLMs of substantial scale. We present a comprehensive, instruction-based model trained on a dataset that consists of approximately 200,000 instruction-focused samples.
Explanatory Argument Extraction of Correct Answers in Resident Medical Exams
Paper • 2312.00567 • PublishedNote We present a new dataset which (i) includes explanatory arguments for both correct and incorrect answers; (ii) written by medical doctors to answer questions from the Spanish Residency Medical Exams. Furthermore, this new benchmark allows us to setup a novel extractive task which consists of identifying the explanation of the correct answer written by medical doctors.
MEDITRON-70B: Scaling Medical Pretraining for Large Language Models
Paper • 2311.16079 • Published • 20Note In this work, we improve access to large-scale medical LLMs by releasing MEDITRON: a suite of open-source LLMs with 7B and 70B parameters adapted to the medical domain. MEDITRON builds on Llama-2, and extends pretraining on a comprehensively curated medical corpus, including selected PubMed articles, abstracts, and internationally-recognized medical guidelines.
BioLORD-2023: Semantic Textual Representations Fusing LLM and Clinical Knowledge Graph Insights
Paper • 2311.16075 • Published • 6Note In this paper, we investigate the potential of Large Language Models to complement biomedical knowledge graphs in the training of semantic models and introduce BioLORD-2023, a state-of-the-art model for semantic textual similarity and biomedical concept representation designed for the clinical domain.
Overview of Current Applications of Large Language Models in Various Medical Specialities
Paper • 2311.12882 • Published • 1Note This paper gives an overview of the latest applications of Large Language Models (LLMs) in the healthcare sector, highlighting their transformative role in enhancing medical care quality. We explore their utilization in various medical specialties, such as cancer diagnostics, dentistry, nephrology, dermatology, etc.
KBioXLM: A Knowledge-anchored Biomedical Multilingual Pretrained Language Model
Paper • 2311.11564 • Published • 1Note We propose a model called KBioXLM, which transforms the multilingual pretrained model XLM-R into the biomedical domain using a knowledge-anchored approach. We achieve a biomedical multilingual corpus by incorporating three granularity knowledge alignments (entity, fact, and passage levels) into monolingual corpora.
MedAgents: Large Language Models as Collaborators for Zero-shot Medical Reasoning
Paper • 2311.10537 • Published • 3Note We propose a novel Multi-disciplinary Collaboration (MC) framework for the medical domain that leverages LLM-based agents playing different roles and participating in a cooperative dialogue, which enhances their LLM competencies and reasoning skills. This framework is training-free and intuitive.
HuatuoGPT-II, One-stage Training for Medical Adaption of LLMs
Paper • 2311.09774 • Published • 1Note We propose to transform heterogeneous data, from the both pre-training and supervised stages, into a unified, simple input-output pair format. We validate the new protocol in the domains where proprietary LLMs like ChatGPT perform relatively poorly, such as Traditional Chinese Medicine.
Autoregressive Language Models For Estimating the Entropy of Epic EHR Audit Logs
Paper • 2311.06401 • Published • 1Note Existing techniques to measure the complexity of workflow through EHR audit logs involve time- or frequency-based cross-sectional aggregations that are unable to capture the full complexity of a EHR session. We evaluate the usage of transformer-based tabular LMs in measuring the entropy of action sequences within workflow and release the evaluated models publicly.
Relation Extraction in underexplored biomedical domains: A diversity-optimised sampling and synthetic data generation approach
Paper • 2311.06364 • Published • 1Note We address the challenge of developing Relation Extraction models in biomedical areas, focusing on the sparsity of labeled data, particularly in the natural-products literature. We introduce a novel Greedy Maximum Entropy sampler to create a curated evaluation dataset and training sets using the LOTUS database.
ChiMed-GPT: A Chinese Medical Large Language Model with Full Training Regime and Better Alignment to Human Preferences
Paper • 2311.06025 • Published • 1Note We propose ChiMed-GPT, a new benchmark LLM designed explicitly for Chinese medical domain, with enlarged context length to 4,096 tokens and undergoes a comprehensive training regime with pre-training, SFT, and RLHF; and evaluations on real-world tasks including information extraction, question answering, and dialogue generation.
BioInstruct: Instruction Tuning of Large Language Models for Biomedical Natural Language Processing
Paper • 2310.19975 • Published • 1Note We created the BioInstruct, comprising 25,005 instructions to instruction-tune LLMs(LLaMA 1 & 2, 7B & 13B version). The instructions were created by prompting the GPT-4 language model with three-seed samples randomly drawn from 80 human curated instructions. We then evaluated instruction-tuned LLMs on several BioNLP tasks.
MedEval: A Multi-Level, Multi-Task, and Multi-Domain Medical Benchmark for Language Model Evaluation
Paper • 2310.14088 • Published • 1Note This study assesses the ability of state-of-the-art large language models (LLMs) including GPT-3.5, GPT-4, Falcon, and LLaMA 2 to identify patients with mild cognitive impairment (MCI) from discharge summaries and examines instances where the models' responses were misaligned with their reasoning.
Rather a Nurse than a Physician -- Contrastive Explanations under Investigation
Paper • 2310.11906 • Published • 1Note Contrastive explanations, where one decision is explained in contrast to another, are supposed to be closer to how humans explain decisions. We fine-tune and extract explanations from 3 chat models. A comparison between human and model rationales, both in contrastive and non-contrastive settings, shows that humans do not necessarily explain in a contrastive manner.
xMEN: A Modular Toolkit for Cross-Lingual Medical Entity Normalization
Paper • 2310.11275 • Published • 1Note We introduce xMEN, a modular system for cross-lingual medical entity normalization, which performs well in both low- and high-resource scenarios. When synonyms in the target language are scarce for a given terminology, we leverage English aliases via cross-lingual candidate generation. For candidate ranking, we incorporate a trainable cross-encoder model.
Emulating Human Cognitive Processes for Expert-Level Medical Question-Answering with Large Language Models
Paper • 2310.11266 • Published • 1Note We introduce BooksMed, a novel framework based on a Large Language Model (LLM) which uniquely emulates human cognitive processes to deliver evidence-based and reliable responses, utilizing the GRADE (Grading of Recommendations, Assessment, Development, and Evaluations) framework to effectively quantify evidence strength.
JMedLoRA:Medical Domain Adaptation on Japanese Large Language Models using Instruction-tuning
Paper • 2310.10083 • Published • 2Note We show the contribution of LoRA-based instruction-tuning to performance in Japanese medical question-answering tasks. Our findings suggest that LoRA-based instruction-tuning can partially incorporate domain-specific knowledge into LLMs, with larger models demonstrating more pronounced effects.
BioT5: Enriching Cross-modal Integration in Biology with Chemical Knowledge and Natural Language Associations
Paper • 2310.07276 • Published • 5Note We propose BioT5, a comprehensive pre-training framework that enriches cross-modal integration in biology with chemical knowledge and natural language associations. BioT5 utilizes SELFIES for 100 robust molecular representations and extracts knowledge from the surrounding context of bio-entities in unstructured biological literature.
A Survey of Large Language Models for Healthcare: from Data, Technology, and Applications to Accountability and Ethics
Paper • 2310.05694 • Published • 3Note This survey outlines the capabilities of the currently developed LLMs for Healthcare and explicates their development process, with the aim of providing an overview of the development roadmap from traditional Pretrained Language Models (PLMs) to LLMs.
AfriSpeech-200: Pan-African Accented Speech Dataset for Clinical and General Domain ASR
Paper • 2310.00274 • Published • 3Note We release AfriSpeech, 200hrs of Pan-African English speech, 67,577 clips from 2,463 unique speakers across 120 indigenous accents from 13 countries for clinical and general domain ASR, a benchmark test set, with publicly available pre-trained models with SOTA performance on the AfriSpeech benchmark.
MedEdit: Model Editing for Medical Question Answering with External Knowledge Bases
Paper • 2309.16035 • Published • 1Note Our study delves into model editing utilizing in-context learning, aiming to improve LLM responses without the need for fine-tuning or retraining. Specifically, we propose a comprehensive retrieval strategy to extract medical facts from an external knowledge base, and then we incorporate them into the query prompt for the LLM.
Large Language Models and Control Mechanisms Improve Text Readability of Biomedical Abstracts
Paper • 2309.13202 • Published • 1Note In this work, we investigate the ability of state-of-the-art large language models (LLMs) on the task of biomedical abstract simplification, using the publicly available dataset for plain language adaptation of biomedical abstracts (PLABA).
HealthFC: A Dataset of Health Claims for Evidence-Based Medical Fact-Checking
Paper • 2309.08503 • Published • 1Note We introduce a dataset of 750 health-related claims, labeled for veracity by medical experts and backed with evidence from appropriate clinical studies. The dataset can be used for tasks related to automated fact-checking such as evidence retrieval, veracity prediction, and explanation generation.
Clinical Text Summarization: Adapting Large Language Models Can Outperform Human Experts
Paper • 2309.07430 • Published • 27Note In this work, we employ domain adaptation methods on eight LLMs, spanning six datasets and four distinct summarization tasks: radiology reports, patient questions, progress notes, and doctor-patient dialogue. Our thorough quantitative assessment reveals trade-offs between models and adaptation methods.
Publicly Shareable Clinical Large Language Model Built on Synthetic Clinical Notes
Paper • 2309.00237 • Published • 3Note In this article, we create synthetic large-scale clinical notes using publicly available case reports extracted from biomedical literature. We then use these synthetic notes to train our specialized clinical large language model, Asclepius. Our findings convincingly demonstrate that synthetic clinical notes can serve as viable substitutes for real ones.
BioCoder: A Benchmark for Bioinformatics Code Generation with Contextual Pragmatic Knowledge
Paper • 2308.16458 • Published • 10Note We present BioCoder, a benchmark developed to evaluate existing pre-trained models in generating bioinformatics code. In relation to function-code generation, BioCoder covers potential package dependencies, class declarations, and global variables. It incorporates functions and methods in Python and Java from GitHub and the Rosalind Project.
MedAlign: A Clinician-Generated Dataset for Instruction Following with Electronic Medical Records
Paper • 2308.14089 • Published • 29Note We introduce MedAlign, a benchmark dataset of 983 natural language instructions for EHR data. MedAlign is curated by 15 clinicians (7 specialities), includes clinician-written reference responses for 303 instructions, and provides 276 longitudinal EHRs for grounding instruction-response pairs.
CMB: A Comprehensive Medical Benchmark in Chinese
Paper • 2308.08833 • Published • 1Note We propose a localized medical benchmark called CMB, a Comprehensive Medical Benchmark in Chinese, designed and rooted entirely within the native Chinese linguistic and cultural framework. While traditional Chinese medicine is integral to this evaluation, it does not constitute its entirety.
BIOptimus: Pre-training an Optimal Biomedical Language Model with Curriculum Learning for Named Entity Recognition
Paper • 2308.08625 • Published • 2Note This paper aims to investigate different pre-training methods, such as pre-training the biomedical LM from scratch and pre-training it in a continued fashion. We also propose and evaluate initializing weights for new tokens by distilling existing weights from the BERT model inside the context where the tokens were found.
Large Language Models to Identify Social Determinants of Health in Electronic Health Records
Paper • 2308.06354 • Published • 3Note This study researched the ability of large language models to extract SDoH from free text in EHRs, where they are most commonly documented, and explored the role of synthetic clinical text for improving the extraction of these scarcely documented, yet extremely valuable, clinical data.
Med-HALT: Medical Domain Hallucination Test for Large Language Models
Paper • 2307.15343 • Published • 2Note This research paper focuses on the challenges posed by hallucinations in LLMs, particularly in the context of the medical domain. We propose a new benchmark and dataset, Med-HALT (Medical Domain Hallucination Test), designed specifically to evaluate and reduce hallucinations. Med-HALT includes two categories of tests reasoning and memory-based hallucination tests.
Mental-LLM: Leveraging Large Language Models for Mental Health Prediction via Online Text Data
Paper • 2307.14385 • Published • 2Note In this work, we present the first comprehensive evaluation of multiple LLMs, including Alpaca, Alpaca-LoRA, FLAN-T5, GPT-3.5, and GPT-4, on various mental health prediction tasks via online text data.
Towards Generalist Biomedical AI
Paper • 2307.14334 • Published • 12Note Med-PaLM M is a large multimodal generative model that flexibly encodes and interprets biomedical data including clinical language, imaging, and genomics with the same set of model weights. Med-PaLM M reaches performance competitive with or exceeding the state of the art on all MultiMedBench tasks, often surpassing specialist models by a wide margin.
Making the Most Out of the Limited Context Length: Predictive Power Varies with Clinical Note Type and Note Section
Paper • 2307.07051 • Published • 1Note We propose a framework to analyze the sections with high predictive power. Using MIMIC-III, we show that: 1) predictive power distribution is different between nursing notes and discharge notes and 2) combining different types of notes could improve performance when the context length is large.
Distilling Large Language Models for Biomedical Knowledge Extraction: A Case Study on Adverse Drug Events
Paper • 2307.06439 • Published • 9Note In this paper, we study how LLMs can be used to scale biomedical knowledge curation. We find that while LLMs already possess decent competency in structuring biomedical text, by distillation into a task-specific student model through self-supervised learning, substantial gains can be attained over out-of-box LLMs.
EHRSHOT: An EHR Benchmark for Few-Shot Evaluation of Foundation Models
Paper • 2307.02028 • Published • 3Note First, we publish a new dataset, EHRSHOT, which contains deidentified structured data from the electronic health records (EHRs) of 6,739 patients from Stanford Medicine. Second, we publish the weights of CLMBR-T-base, a 141M parameter clinical foundation model pretrained on the structured EHR data of 2.57M patients. Third, we define 15 few-shot clinical prediction tasks.
BioCPT: Contrastive Pre-trained Transformers with Large-scale PubMed Search Logs for Zero-shot Biomedical Information Retrieval
Paper • 2307.00589 • Published • 1Note We introduce BioCPT, a first-of-its-kind Contrastively Pre-trained Transformer model for zero-shot biomedical IR. To train BioCPT, we collected an unprecedented scale of 255 million user click logs from PubMed. With such data, we use contrastive learning to train a pair of closely-integrated retriever and re-ranker.
How far is Language Model from 100% Few-shot Named Entity Recognition in Medical Domain
Paper • 2307.00186 • Published • 1Note This paper aims to provide a thorough investigation to compare the performance of LMs in medical few-shot NER and answer How far is LMs from 100% Few-shot NER in Medical Domain, and moreover to explore an effective entity recognizer to help improve the NER performance.
Biomedical Language Models are Robust to Sub-optimal Tokenization
Paper • 2306.17649 • Published • 1Note In this work, we first find that standard open-domain and biomedical tokenizers are largely unable to segment biomedical terms into meaningful components. But surprisingly, we find that pre-training a biomedical LM using a more accurate biomedical tokenizer does not improve the entity representation quality of a language model.
CamemBERT-bio: a Tasty French Language Model Better for your Health
Paper • 2306.15550 • Published • 3Note We propose a new French public biomedical dataset on which we have continued the pre-training of CamemBERT. Thus, we introduce a first version of CamemBERT-bio, a specialized public model for the French biomedical domain that shows 2.54 points of F1 score improvement on average on different biomedical named entity recognition tasks.
Radiology-GPT: A Large Language Model for Radiology
Paper • 2306.08666 • Published • 1Note We introduce Radiology-GPT, a large language model for radiology. Using an instruction tuning approach on an extensive dataset of radiology domain knowledge, Radiology-GPT demonstrates superior performance compared to general language models such as StableLM, Dolly and LLaMA.
Mol-Instructions: A Large-Scale Biomolecular Instruction Dataset for Large Language Models
Paper • 2306.08018 • Published • 4Note We introduce Mol-Instructions, a meticulously curated, comprehensive instruction dataset expressly designed for the biomolecular realm. Mol-Instructions is composed of three pivotal components: molecule-oriented instructions, protein-oriented instructions, and biomolecular text instructions.
Multilingual Clinical NER: Translation or Cross-lingual Transfer?
Paper • 2306.04384 • Published • 1Note This paper compares cross-lingual transfer with these two alternative methods, to perform clinical NER in French and in German without any training data in those languages. To this end, we release MedNERF a medical NER test set extracted from French drug prescriptions and annotated with the same guidelines as an English dataset.
ACI-BENCH: a Novel Ambient Clinical Intelligence Dataset for Benchmarking Automatic Visit Note Generation
Paper • 2306.02022 • Published • 1Note In this paper, we present the Ambient Clinical Intelligence Benchmark (ACI-BENCH) corpus, the largest dataset to date tackling the problem of AI-assisted note generation from visit dialogue. We also present the benchmark performances of several common state-of-the-art approaches.
LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day
Paper • 2306.00890 • Published • 10Note We propose a cost-efficient approach for training a vision-language conversational assistant that can answer open-ended research questions of biomedical images. The key idea is to leverage a large-scale, broad-coverage biomedical figure-caption dataset extracted from PubMed Central, use GPT-4 to self-instruct instruction-following data from the captions.
BiomedGPT: A Unified and Generalist Biomedical Generative Pre-trained Transformer for Vision, Language, and Multimodal Tasks
Paper • 2305.17100 • Published • 2Note In this paper, we introduce a unified and generalist Biomedical Generative Pre-trained
Towards Expert-Level Medical Question Answering with Large Language Models
Paper • 2305.09617 • Published • 5Note We present Med-PaLM 2, which bridges these gaps by leveraging a combination of base LLM improvements (PaLM 2), medical domain finetuning, and prompting strategies including a novel ensemble refinement approach.
Dr. LLaMA: Improving Small Language Models in Domain-Specific QA via Generative Data Augmentation
Paper • 2305.07804 • Published • 2Note In this paper, we introduce Dr. LLaMA, a method for improving SLMs through generative data augmentation using LLMs, focusing on medical question-answering tasks and the PubMedQA dataset. Our findings indicate that LLMs effectively refine and diversify existing question-answer pairs.
RadAdapt: Radiology Report Summarization via Lightweight Domain Adaptation of Large Language Models
Paper • 2305.01146 • Published • 1Note We systematically investigate lightweight strategies to adapt large language models (LLMs) for the task of radiology report summarization (RRS). Our results on the MIMIC-III dataset consistently demonstrate best performance by maximally adapting to the task via pretraining on clinical text and parameter-efficient fine-tuning on RRS examples.
A Biomedical Entity Extraction Pipeline for Oncology Health Records in Portuguese
Paper • 2304.08999 • Published • 2Note In this paper, we present the approach we developed to extract procedures, drugs, and diseases from oncology health records written in European Portuguese. Since there wasno annotated corpus for biomedical entity extraction in Portuguese prior to this work, we also present the strategy we followed in annotating the corpus for the development of the models.
DrBERT: A Robust Pre-trained Model in French for Biomedical and Clinical domains
Paper • 2304.00958 • Published • 1Note In this paper, we propose an original study of PLMs in the medical domain on French language. We also release the first specialized PLMs for the biomedical field in French, called DrBERT, as well as the largest corpus of medical data under free license on which these models are trained.
ChatDoctor: A Medical Chat Model Fine-tuned on LLaMA Model using Medical Domain Knowledge
Paper • 2303.14070 • Published • 11Note We collected more than 700 diseases and their corresponding symptoms, recommended medications, and required medical tests, and then generated 5K doctor-patient conversations. Models finetuned on these emerge with great potential to understand patients' needs, provide informed advice, and offer valuable assistance in a variety of medical-related fields.
Capabilities of GPT-4 on Medical Challenge Problems
Paper • 2303.13375 • Published • 1Note We present a comprehensive evaluation of GPT-4 on medical competency examinations and benchmark datasets. Our results show that GPT-4, without any specialized prompt crafting, exceeds the passing score on USMLE by over 20 points and outperforms earlier general-purpose models (GPT-3.5) as well as models specifically fine-tuned on medical knowledge (Med-PaLM, a tuned version of Flan-PaLM 540B).
MEDBERT.de: A Comprehensive German BERT Model for the Medical Domain
Paper • 2303.08179 • Published • 2Note The model has been trained on a large corpus of 4.7 Million German medical documents and has been shown to achieve new state-of-the-art performance on eight different medical benchmarks covering a wide range of disciplines and medical document types. In addition to evaluating the model, this paper also conducts an in-depth analysis of its capabilities.
Almanac: Retrieval-Augmented Language Models for Clinical Medicine
Paper • 2303.01229 • Published • 1Note Large language models have a tendency to generate factually incorrect and sometimes even toxic statements. By enabling these models to access external point-of-care tools in response to physician queries, we demonstrate significantly improved factual grounding, helpfulness, and safety in a variety of clinical scenarios.
Large-Scale Domain-Specific Pretraining for Biomedical Vision-Language Processing
Paper • 2303.00915 • Published • 6Note In this paper, we conducted by far the largest study on biomedical VLP, using 15 million figure-caption pairs extracted from biomedical research articles in PubMed Central. BiomedCLIP established new state of the art in a wide range of standard datasets, substantially outperformed prior VLP approaches.
Do We Still Need Clinical Language Models?
Paper • 2302.08091 • Published • 3Note We show that relatively small specialized clinical models substantially outperform all in-context learning approaches, even when finetuned on limited annotated data. Further, we find that pretraining on clinical tokens allows for smaller, more parameter-efficient models that either match or outperform much larger language models trained on general text.
EHRSQL: A Practical Text-to-SQL Benchmark for Electronic Health Records
Paper • 2301.07695 • Published • 1Note We present a new text-to-SQL dataset for electronic health records (EHRs). The utterances were collected from 222 hospital staff, including physicians, nurses, insurance review and health records teams, and more. Our dataset poses unique challenges: 1) generate SQL queries, 2) understand various time expressions, and 3) distinguish whether a given question is answerable.
Large Language Models Encode Clinical Knowledge
Paper • 2212.13138 • Published • 3Note We present MultiMedQA, a benchmark combining six existing open question answering datasets spanning professional medical exams, research, and consumer queries; and HealthSearchQA, a new free-response dataset of medical questions searched online. We evaluate PaLM (a 540-billion parameter LLM) and its instruction-tuned variant, Flan-PaLM, on MultiMedQA.
Scientific and Creative Analogies in Pretrained Language Models
Paper • 2211.15268 • Published • 1Note This paper examines the encoding of analogy in large-scale pretrained language models. Existing analogy datasets typically focus on a limited set of analogical relations, with a high similarity of the two domains between which the analogy holds. On the other hand, SCAN contains systematic mappings of multiple attributes and relational structures across dissimilar domains.
RoentGen: Vision-Language Foundation Model for Chest X-ray Generation
Paper • 2211.12737 • Published • 2Note We fine-tuned a diffusion model on a corpus of publicly available chest x-rays (CXR) and their corresponding radiology (text) reports. We present evidence that the resulting model is able to create visually convincing, diverse synthetic CXR images, and that the output can be controlled by using free-form text prompts including radiology-specific language.
A Large-Scale Dataset for Biomedical Keyphrase Generation
Paper • 2211.12124 • Published • 1Note We introduce kp-biomed, the first large-scale biomedical keyphrase generation dataset with more than 5M documents collected from PubMed abstracts. We train and release several generative models and conduct a series of experiments showing that using large scale datasets improves significantly the performances for present and absent keyphrase generation.
AF Adapter: Continual Pretraining for Building Chinese Biomedical Language Model
Paper • 2211.11363 • Published • 1Note Sequential task training may cause catastrophic forgetting, so we propose a continual pretraining method for the BERT-based model. Despite training only 3% of model parameters, our method could achieve better-than-SOTA performance (on chinese biomedical tasks).
Galactica: A Large Language Model for Science
Paper • 2211.09085 • Published • 4Note In this paper we introduce Galactica: a large language model that can store, combine and reason about scientific knowledge. It sets a new state-of-the-art on downstream tasks such as PubMedQA and MedMCQA dev of 77.6% and 52.9%.
BioLORD: Learning Ontological Representations from Definitions (for Biomedical Concepts and their Textual Descriptions)
Paper • 2210.11892 • Published • 2Note In this work, we propose a new method for learning vector representations of biomedical terms that are based on definitions and descriptions from a knowledge graph. Thanks to this grounding, our model produces more semantic concept representations than SapBERT, and which match more closely the hierarchical structure of ontologies. The model also generalizes to clinical sentences similarity (STS).