Mixture models for undiagnosed prevalent disease and interval-censored incident disease: applications to a cohort assembled from electronic health records

Li C. Cheung; Qing Pan; Noorie Hyun; Mark Schiffman; Barbara Fetterman; Philip E. Castle; Thomas Lorey; Hormuzd A. Katki

doi:10.1002/sim.7380

Mixture models for undiagnosed prevalent disease and interval-censored incident disease: applications to a cohort assembled from electronic health records

Li C. Cheung, Qing Pan, Noorie Hyun, Mark Schiffman, Barbara Fetterman, Philip E. Castle, Thomas Lorey, Hormuzd A. Katki

Epidemiology & Population Health

Research output: Contribution to journal › Article › peer-review

25 Scopus citations

Abstract

For cost-effectiveness and efficiency, many large-scale general-purpose cohort studies are being assembled within large health-care providers who use electronic health records. Two key features of such data are that incident disease is interval-censored between irregular visits and there can be pre-existing (prevalent) disease. Because prevalent disease is not always immediately diagnosed, some disease diagnosed at later visits are actually undiagnosed prevalent disease. We consider prevalent disease as a point mass at time zero for clinical applications where there is no interest in time of prevalent disease onset. We demonstrate that the naive Kaplan–Meier cumulative risk estimator underestimates risks at early time points and overestimates later risks. We propose a general family of mixture models for undiagnosed prevalent disease and interval-censored incident disease that we call prevalence–incidence models. Parameters for parametric prevalence–incidence models, such as the logistic regression and Weibull survival (logistic–Weibull) model, are estimated by direct likelihood maximization or by EM algorithm. Non-parametric methods are proposed to calculate cumulative risks for cases without covariates. We compare naive Kaplan–Meier, logistic–Weibull, and non-parametric estimates of cumulative risk in the cervical cancer screening program at Kaiser Permanente Northern California. Kaplan–Meier provided poor estimates while the logistic–Weibull model was a close fit to the non-parametric. Our findings support our use of logistic–Weibull models to develop the risk estimates that underlie current US risk-based cervical cancer screening guidelines. Published 2017. This article has been contributed to by US Government employees and their work is in the public domain in the USA.

Original language	English (US)
Pages (from-to)	3583-3595
Number of pages	13
Journal	Statistics in Medicine
Volume	36
Issue number	22
DOIs	https://doi.org/10.1002/sim.7380
State	Published - Sep 30 2017

Keywords

HPV
Kaplan–Meier
cervical cancer
cumulative risk estimation
prevalence–incidence models

ASJC Scopus subject areas

Epidemiology
Statistics and Probability

UN SDGs

This output contributes to the following UN Sustainable Development Goals (SDGs)

Access to Document

10.1002/sim.7380

Cite this

@article{14f7c18ceada49258677b5dd4a59f294,

title = "Mixture models for undiagnosed prevalent disease and interval-censored incident disease: applications to a cohort assembled from electronic health records",

abstract = "For cost-effectiveness and efficiency, many large-scale general-purpose cohort studies are being assembled within large health-care providers who use electronic health records. Two key features of such data are that incident disease is interval-censored between irregular visits and there can be pre-existing (prevalent) disease. Because prevalent disease is not always immediately diagnosed, some disease diagnosed at later visits are actually undiagnosed prevalent disease. We consider prevalent disease as a point mass at time zero for clinical applications where there is no interest in time of prevalent disease onset. We demonstrate that the naive Kaplan–Meier cumulative risk estimator underestimates risks at early time points and overestimates later risks. We propose a general family of mixture models for undiagnosed prevalent disease and interval-censored incident disease that we call prevalence–incidence models. Parameters for parametric prevalence–incidence models, such as the logistic regression and Weibull survival (logistic–Weibull) model, are estimated by direct likelihood maximization or by EM algorithm. Non-parametric methods are proposed to calculate cumulative risks for cases without covariates. We compare naive Kaplan–Meier, logistic–Weibull, and non-parametric estimates of cumulative risk in the cervical cancer screening program at Kaiser Permanente Northern California. Kaplan–Meier provided poor estimates while the logistic–Weibull model was a close fit to the non-parametric. Our findings support our use of logistic–Weibull models to develop the risk estimates that underlie current US risk-based cervical cancer screening guidelines. Published 2017. This article has been contributed to by US Government employees and their work is in the public domain in the USA.",

keywords = "HPV, Kaplan–Meier, cervical cancer, cumulative risk estimation, prevalence–incidence models",

author = "Cheung, {Li C.} and Qing Pan and Noorie Hyun and Mark Schiffman and Barbara Fetterman and Castle, {Philip E.} and Thomas Lorey and Katki, {Hormuzd A.}",

note = "Publisher Copyright: Published 2017. This article has been contributed to by US Government employees and their work is in the public domain in the USA.",

year = "2017",

month = sep,

day = "30",

doi = "10.1002/sim.7380",

language = "English (US)",

volume = "36",

pages = "3583--3595",

journal = "Statistics in Medicine",

issn = "0277-6715",

publisher = "John Wiley and Sons Ltd",

number = "22",

}

TY - JOUR

T1 - Mixture models for undiagnosed prevalent disease and interval-censored incident disease

T2 - applications to a cohort assembled from electronic health records

AU - Cheung, Li C.

AU - Pan, Qing

AU - Hyun, Noorie

AU - Schiffman, Mark

AU - Fetterman, Barbara

AU - Castle, Philip E.

AU - Lorey, Thomas

AU - Katki, Hormuzd A.

N1 - Publisher Copyright: Published 2017. This article has been contributed to by US Government employees and their work is in the public domain in the USA.

PY - 2017/9/30

Y1 - 2017/9/30

N2 - For cost-effectiveness and efficiency, many large-scale general-purpose cohort studies are being assembled within large health-care providers who use electronic health records. Two key features of such data are that incident disease is interval-censored between irregular visits and there can be pre-existing (prevalent) disease. Because prevalent disease is not always immediately diagnosed, some disease diagnosed at later visits are actually undiagnosed prevalent disease. We consider prevalent disease as a point mass at time zero for clinical applications where there is no interest in time of prevalent disease onset. We demonstrate that the naive Kaplan–Meier cumulative risk estimator underestimates risks at early time points and overestimates later risks. We propose a general family of mixture models for undiagnosed prevalent disease and interval-censored incident disease that we call prevalence–incidence models. Parameters for parametric prevalence–incidence models, such as the logistic regression and Weibull survival (logistic–Weibull) model, are estimated by direct likelihood maximization or by EM algorithm. Non-parametric methods are proposed to calculate cumulative risks for cases without covariates. We compare naive Kaplan–Meier, logistic–Weibull, and non-parametric estimates of cumulative risk in the cervical cancer screening program at Kaiser Permanente Northern California. Kaplan–Meier provided poor estimates while the logistic–Weibull model was a close fit to the non-parametric. Our findings support our use of logistic–Weibull models to develop the risk estimates that underlie current US risk-based cervical cancer screening guidelines. Published 2017. This article has been contributed to by US Government employees and their work is in the public domain in the USA.

AB - For cost-effectiveness and efficiency, many large-scale general-purpose cohort studies are being assembled within large health-care providers who use electronic health records. Two key features of such data are that incident disease is interval-censored between irregular visits and there can be pre-existing (prevalent) disease. Because prevalent disease is not always immediately diagnosed, some disease diagnosed at later visits are actually undiagnosed prevalent disease. We consider prevalent disease as a point mass at time zero for clinical applications where there is no interest in time of prevalent disease onset. We demonstrate that the naive Kaplan–Meier cumulative risk estimator underestimates risks at early time points and overestimates later risks. We propose a general family of mixture models for undiagnosed prevalent disease and interval-censored incident disease that we call prevalence–incidence models. Parameters for parametric prevalence–incidence models, such as the logistic regression and Weibull survival (logistic–Weibull) model, are estimated by direct likelihood maximization or by EM algorithm. Non-parametric methods are proposed to calculate cumulative risks for cases without covariates. We compare naive Kaplan–Meier, logistic–Weibull, and non-parametric estimates of cumulative risk in the cervical cancer screening program at Kaiser Permanente Northern California. Kaplan–Meier provided poor estimates while the logistic–Weibull model was a close fit to the non-parametric. Our findings support our use of logistic–Weibull models to develop the risk estimates that underlie current US risk-based cervical cancer screening guidelines. Published 2017. This article has been contributed to by US Government employees and their work is in the public domain in the USA.

KW - HPV

KW - Kaplan–Meier

KW - cervical cancer

KW - cumulative risk estimation

KW - prevalence–incidence models

UR - http://www.scopus.com/inward/record.url?scp=85021437043&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85021437043&partnerID=8YFLogxK

U2 - 10.1002/sim.7380

DO - 10.1002/sim.7380

M3 - Article

C2 - 28660629

AN - SCOPUS:85021437043

SN - 0277-6715

VL - 36

SP - 3583

EP - 3595

JO - Statistics in Medicine

JF - Statistics in Medicine

IS - 22

ER -

Mixture models for undiagnosed prevalent disease and interval-censored incident disease: applications to a cohort assembled from electronic health records

Abstract

Keywords

ASJC Scopus subject areas

UN SDGs

Access to Document

Other files and links

Fingerprint

Cite this