Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 21
Filtrar
1.
Sensors (Basel) ; 24(15)2024 Jul 30.
Artículo en Inglés | MEDLINE | ID: mdl-39123969

RESUMEN

License plate (LP) information is an important part of personal privacy, which is protected by law. However, in some publicly available transportation datasets, the LP areas in the images have not been processed. Other datasets have applied simple de-identification operations such as blurring and masking. Such crude operations will lead to a reduction in data utility. In this paper, we propose a method of LP de-identification based on a generative adversarial network (LPDi GAN) to transform an original image to a synthetic one with a generated LP. To maintain the original LP attributes, the background features are extracted from the background to generate LPs that are similar to the originals. The LP template and LP style are also fed into the network to obtain synthetic LPs with controllable characters and higher quality. The results show that LPDi GAN can perceive changes in environmental conditions and LP tilt angles, and control the LP characters through the LP templates. The perceptual similarity metric, Learned Perceptual Image Patch Similarity (LPIPS), reaches 0.25 while ensuring the effect of character recognition on de-identified images, demonstrating that LPDi GAN can achieve outstanding de-identification while preserving strong data utility.

2.
BMC Med Inform Decis Mak ; 24(1): 147, 2024 May 30.
Artículo en Inglés | MEDLINE | ID: mdl-38816848

RESUMEN

BACKGROUND: Securing adequate data privacy is critical for the productive utilization of data. De-identification, involving masking or replacing specific values in a dataset, could damage the dataset's utility. However, finding a reasonable balance between data privacy and utility is not straightforward. Nonetheless, few studies investigated how data de-identification efforts affect data analysis results. This study aimed to demonstrate the effect of different de-identification methods on a dataset's utility with a clinical analytic use case and assess the feasibility of finding a workable tradeoff between data privacy and utility. METHODS: Predictive modeling of emergency department length of stay was used as a data analysis use case. A logistic regression model was developed with 1155 patient cases extracted from a clinical data warehouse of an academic medical center located in Seoul, South Korea. Nineteen de-identified datasets were generated based on various de-identification configurations using ARX, an open-source software for anonymizing sensitive personal data. The variable distributions and prediction results were compared between the de-identified datasets and the original dataset. We examined the association between data privacy and utility to determine whether it is feasible to identify a viable tradeoff between the two. RESULTS: All 19 de-identification scenarios significantly decreased re-identification risk. Nevertheless, the de-identification processes resulted in record suppression and complete masking of variables used as predictors, thereby compromising dataset utility. A significant correlation was observed only between the re-identification reduction rates and the ARX utility scores. CONCLUSIONS: As the importance of health data analysis increases, so does the need for effective privacy protection methods. While existing guidelines provide a basis for de-identifying datasets, achieving a balance between high privacy and utility is a complex task that requires understanding the data's intended use and involving input from data users. This approach could help find a suitable compromise between data privacy and utility.


Asunto(s)
Confidencialidad , Anonimización de la Información , Humanos , Confidencialidad/normas , Servicio de Urgencia en Hospital , Tiempo de Internación , República de Corea , Masculino
3.
JMIR Form Res ; 8: e53241, 2024 Apr 22.
Artículo en Inglés | MEDLINE | ID: mdl-38648097

RESUMEN

BACKGROUND: Electronic health records are a valuable source of patient information that must be properly deidentified before being shared with researchers. This process requires expertise and time. In addition, synthetic data have considerably reduced the restrictions on the use and sharing of real data, allowing researchers to access it more rapidly with far fewer privacy constraints. Therefore, there has been a growing interest in establishing a method to generate synthetic data that protects patients' privacy while properly reflecting the data. OBJECTIVE: This study aims to develop and validate a model that generates valuable synthetic longitudinal health data while protecting the privacy of the patients whose data are collected. METHODS: We investigated the best model for generating synthetic health data, with a focus on longitudinal observations. We developed a generative model that relies on the generalized canonical polyadic (GCP) tensor decomposition. This model also involves sampling from a latent factor matrix of GCP decomposition, which contains patient factors, using sequential decision trees, copula, and Hamiltonian Monte Carlo methods. We applied the proposed model to samples from the MIMIC-III (version 1.4) data set. Numerous analyses and experiments were conducted with different data structures and scenarios. We assessed the similarity between our synthetic data and the real data by conducting utility assessments. These assessments evaluate the structure and general patterns present in the data, such as dependency structure, descriptive statistics, and marginal distributions. Regarding privacy disclosure, our model preserves privacy by preventing the direct sharing of patient information and eliminating the one-to-one link between the observed and model tensor records. This was achieved by simulating and modeling a latent factor matrix of GCP decomposition associated with patients. RESULTS: The findings show that our model is a promising method for generating synthetic longitudinal health data that is similar enough to real data. It can preserve the utility and privacy of the original data while also handling various data structures and scenarios. In certain experiments, all simulation methods used in the model produced the same high level of performance. Our model is also capable of addressing the challenge of sampling patients from electronic health records. This means that we can simulate a variety of patients in the synthetic data set, which may differ in number from the patients in the original data. CONCLUSIONS: We have presented a generative model for producing synthetic longitudinal health data. The model is formulated by applying the GCP tensor decomposition. We have provided 3 approaches for the synthesis and simulation of a latent factor matrix following the process of factorization. In brief, we have reduced the challenge of synthesizing massive longitudinal health data to synthesizing a nonlongitudinal and significantly smaller data set.

4.
Contemp Clin Trials ; 141: 107514, 2024 06.
Artículo en Inglés | MEDLINE | ID: mdl-38537901

RESUMEN

BACKGROUND: Better use of healthcare systems data, collected as part of interactions between patients and the healthcare system, could transform planning and conduct of randomised controlled trials. Multiple challenges to widespread use include whether healthcare systems data captures sufficiently well the data traditionally captured on case report forms. "Data Utility Comparison Studies" (DUCkS) assess the utility of healthcare systems data for RCTs by comparison to data collected by the trial. Despite their importance, there are few published UK examples of DUCkS. METHODS-AND-RESULTS: Building from ongoing and selected recent examples of UK-led DUCkS in the literature, we set out experience-based considerations for the conduct of future DUCkS. Developed through informal iterative discussions in many forums, considerations are offered for planning, protocol development, data, analysis and reporting, with comparisons at "patient-level" or "trial-level", depending on the item of interest and trial status. DISCUSSION: DUCkS could be a valuable tool in assessing where healthcare systems data can be used for trials and in which trial teams can play a leading role. There is a pressing need for trials to be more efficient in their delivery and research waste must be reduced. Trials have been making inconsistent use of healthcare systems data, not least because of an absence of evidence of utility. DUCkS can also help to identify challenges in using healthcare systems data, such as linkage (access and timing) and data quality. We encourage trial teams to incorporate and report DUCkS in trials and funders and data providers to support them.


Asunto(s)
Ensayos Clínicos Controlados Aleatorios como Asunto , Humanos , Ensayos Clínicos Controlados Aleatorios como Asunto/métodos , Proyectos de Investigación , Atención a la Salud/organización & administración , Reino Unido , Recolección de Datos/métodos
5.
Int J Popul Data Sci ; 8(1): 2158, 2023.
Artículo en Inglés | MEDLINE | ID: mdl-38414544

RESUMEN

Introduction: Federated Learning (FL) is a decentralised approach to training statistical models, where training is performed across multiple clients, producing one global model. Since the training data remains with each local client and is not shared or exchanged with other clients the use of FL may reduce privacy and security risks (compared to methods where multiple data sources are pooled) and can also address data access and heterogeneity problems. Synthetic data is artificially generated data that has the same structure and statistical properties as the original but that does not contain any of the original data records, therefore minimising disclosure risk. Using FL to produce synthetic data (which we refer to as "federated synthesis") has the potential to combine data from multiple clients without compromising privacy, allowing access to data that may otherwise be inaccessible in its raw format. Objectives: The objective was to review current research and practices for using FL to generate synthetic data and determine the extent to which research has been undertaken, the methods and evaluation practices used, and any research gaps. Methods: A scoping review was conducted to systematically map and describe the published literature on the use of FL to generate synthetic data. Relevant studies were identified through online databases and the findings are described, grouped, and summarised. Information extracted included article characteristics, documenting the type of data that is synthesised, the model architecture and the methods (if any) used to evaluate utility and privacy risk. Results: A total of 69 articles were included in the scoping review; all were published between 2018 and 2023 with two thirds (46) in 2022. 30% (21) were focussed on synthetic data generation as the main model output (with 6 of these generating tabular data), whereas 59% (41) focussed on data augmentation. Of the 21 performing federated synthesis, all used deep learning methods (predominantly Generative Adversarial Networks) to generate the synthetic data. Conclusions: Federated synthesis is in its early days but shows promise as a method that can construct a global synthetic dataset without sharing any of the local client data. As a field in its infancy there are areas to explore in terms of the privacy risk associated with the various methods proposed, and more generally in how we measure those risks.


Asunto(s)
Revelación , Lagunas en las Evidencias , Humanos , Bases de Datos Factuales , Diseño Interior y Mobiliario , Sistemas de Registros Médicos Computarizados
6.
Crit Rev Food Sci Nutr ; : 1-17, 2022 Jul 26.
Artículo en Inglés | MEDLINE | ID: mdl-35880485

RESUMEN

In this age of data, digital tools are widely promoted as having tremendous potential for enhancing food safety. However, the potential of these digital tools depends on the availability and quality of data, and a number of obstacles need to be overcome to achieve the goal of digitally enabled "smarter food safety" approaches. One key obstacle is that participants in the food system and in food safety often lack the willingness to share data, due to fears of data abuse, bad publicity, liability, and the need to keep certain data (e.g., human illness data) confidential. As these multifaceted concerns lead to tension between data utility and privacy, the solutions to these challenges need to be multifaceted. This review outlines the data needs in digital food safety systems, exemplified in different data categories and model types, and key concerns associated with sharing of food safety data, including confidentiality and privacy of shared data. To address the data privacy issue a combination of innovative strategies to protect privacy as well as legal protection against data abuse need to be pursued. Existing solutions for maximizing data utility, while not compromising data privacy, are discussed, most notably differential privacy and federated learning.

7.
Entropy (Basel) ; 24(5)2022 May 10.
Artículo en Inglés | MEDLINE | ID: mdl-35626554

RESUMEN

Preserving confidentiality of individuals in data disclosure is a prime concern for public and private organizations. The main challenge in the data disclosure problem is to release data such that misuse by intruders is avoided while providing useful information to legitimate users for analysis. We propose an information theoretic architecture for the data disclosure problem. The proposed framework consists of developing a maximum entropy (ME) model based on statistical information of the actual data, testing the adequacy of the ME model, producing disclosure data from the ME model and quantifying the discrepancy between the actual and the disclosure data. The architecture can be used both for univariate and multivariate data disclosure. We illustrate the implementation of our approach using financial data.

8.
JMIR Med Inform ; 10(4): e35734, 2022 Apr 07.
Artículo en Inglés | MEDLINE | ID: mdl-35389366

RESUMEN

BACKGROUND: A regular task by developers and users of synthetic data generation (SDG) methods is to evaluate and compare the utility of these methods. Multiple utility metrics have been proposed and used to evaluate synthetic data. However, they have not been validated in general or for comparing SDG methods. OBJECTIVE: This study evaluates the ability of common utility metrics to rank SDG methods according to performance on a specific analytic workload. The workload of interest is the use of synthetic data for logistic regression prediction models, which is a very frequent workload in health research. METHODS: We evaluated 6 utility metrics on 30 different health data sets and 3 different SDG methods (a Bayesian network, a Generative Adversarial Network, and sequential tree synthesis). These metrics were computed by averaging across 20 synthetic data sets from the same generative model. The metrics were then tested on their ability to rank the SDG methods based on prediction performance. Prediction performance was defined as the difference between each of the area under the receiver operating characteristic curve and area under the precision-recall curve values on synthetic data logistic regression prediction models versus real data models. RESULTS: The utility metric best able to rank SDG methods was the multivariate Hellinger distance based on a Gaussian copula representation of real and synthetic joint distributions. CONCLUSIONS: This study has validated a generative model utility metric, the multivariate Hellinger distance, which can be used to reliably rank competing SDG methods on the same data set. The Hellinger distance metric can be used to evaluate and compare alternate SDG methods.

9.
J Am Med Inform Assoc ; 29(8): 1350-1365, 2022 07 12.
Artículo en Inglés | MEDLINE | ID: mdl-35357487

RESUMEN

OBJECTIVE: This study sought to evaluate whether synthetic data derived from a national coronavirus disease 2019 (COVID-19) dataset could be used for geospatial and temporal epidemic analyses. MATERIALS AND METHODS: Using an original dataset (n = 1 854 968 severe acute respiratory syndrome coronavirus 2 tests) and its synthetic derivative, we compared key indicators of COVID-19 community spread through analysis of aggregate and zip code-level epidemic curves, patient characteristics and outcomes, distribution of tests by zip code, and indicator counts stratified by month and zip code. Similarity between the data was statistically and qualitatively evaluated. RESULTS: In general, synthetic data closely matched original data for epidemic curves, patient characteristics, and outcomes. Synthetic data suppressed labels of zip codes with few total tests (mean = 2.9 ± 2.4; max = 16 tests; 66% reduction of unique zip codes). Epidemic curves and monthly indicator counts were similar between synthetic and original data in a random sample of the most tested (top 1%; n = 171) and for all unsuppressed zip codes (n = 5819), respectively. In small sample sizes, synthetic data utility was notably decreased. DISCUSSION: Analyses on the population-level and of densely tested zip codes (which contained most of the data) were similar between original and synthetically derived datasets. Analyses of sparsely tested populations were less similar and had more data suppression. CONCLUSION: In general, synthetic data were successfully used to analyze geospatial and temporal trends. Analyses using small sample sizes or populations were limited, in part due to purposeful data label suppression-an attribute disclosure countermeasure. Users should consider data fitness for use in these cases.


Asunto(s)
COVID-19 , SARS-CoV-2 , Estudios de Cohortes , Humanos , Estados Unidos/epidemiología
10.
Int J Popul Data Sci ; 7(1): 1727, 2022.
Artículo en Inglés | MEDLINE | ID: mdl-37650026

RESUMEN

Use of administrative data for research and for planning services has increased over recent decades due to the value of the large, rich information available. However, concerns about the release of sensitive or personal data and the associated disclosure risk can lead to lengthy approval processes and restricted data access. This can delay or prevent the production of timely evidence. A promising solution to facilitate more efficient data access is to create synthetic versions of the original datasets which are less likely to hold confidential information and can minimise disclosure risk. Such data may be used as an interim solution, allowing researchers to develop their analysis plans on non-disclosive data, whilst waiting for access to the real data. We aim to provide an overview of the background and uses of synthetic data and describe common methods used to generate synthetic data in the context of UK administrative research. We propose a simplified terminology for categories of synthetic data (univariate, multivariate, and complex modality synthetic data) as well as a more comprehensive description of the terminology used in the existing literature and illustrate challenges and future directions for research.


Asunto(s)
Revelación , Investigadores , Humanos
11.
J Big Data ; 8(1): 82, 2021.
Artículo en Inglés | MEDLINE | ID: mdl-34777945

RESUMEN

Data-driven innovation is propelled by recent scientific advances, rapid technological progress, substantial reductions of manufacturing costs, and significant demands for effective decision support systems. This has led to efforts to collect massive amounts of heterogeneous and multisource data, however, not all data is of equal quality or equally informative. Previous methods to capture and quantify the utility of data include value of information (VoI), quality of information (QoI), and mutual information (MI). This manuscript introduces a new measure to quantify whether larger volumes of increasingly more complex data enhance, degrade, or alter their information content and utility with respect to specific tasks. We present a new information-theoretic measure, called Data Value Metric (DVM), that quantifies the useful information content (energy) of large and heterogeneous datasets. The DVM formulation is based on a regularized model balancing data analytical value (utility) and model complexity. DVM can be used to determine if appending, expanding, or augmenting a dataset may be beneficial in specific application domains. Subject to the choices of data analytic, inferential, or forecasting techniques employed to interrogate the data, DVM quantifies the information boost, or degradation, associated with increasing the data size or expanding the richness of its features. DVM is defined as a mixture of a fidelity and a regularization terms. The fidelity captures the usefulness of the sample data specifically in the context of the inferential task. The regularization term represents the computational complexity of the corresponding inferential method. Inspired by the concept of information bottleneck in deep learning, the fidelity term depends on the performance of the corresponding supervised or unsupervised model. We tested the DVM method for several alternative supervised and unsupervised regression, classification, clustering, and dimensionality reduction tasks. Both real and simulated datasets with weak and strong signal information are used in the experimental validation. Our findings suggest that DVM captures effectively the balance between analytical-value and algorithmic-complexity. Changes in the DVM expose the tradeoffs between algorithmic complexity and data analytical value in terms of the sample-size and the feature-richness of a dataset. DVM values may be used to determine the size and characteristics of the data to optimize the relative utility of various supervised or unsupervised algorithms.

12.
Res Social Adm Pharm ; 17(5): 930-941, 2021 05.
Artículo en Inglés | MEDLINE | ID: mdl-32883619

RESUMEN

We study the effects of differentially private (DP) noise injection techniques in a survey data setting, using the release of cost of early care and education estimates from the National Survey of Early Care and Education as a motivating example. As an example of how DP noise injection affects statistical estimates, our analysis compares the relative performance of DP techniques in the context of releasing estimates of means, medians, and regression coefficients. The results show that for many statistics, basic DP techniques show good performance provided that the privacy budget does not need to be split over too many estimates. Throughout, we show that small decisions, such as the number of bins in a histogram or the scaling of a variable in a regression equation, can have sometimes dramatic effects on the end results. Because of this, it is important to develop DP techniques with an eye towards the most important aspects of the data for end users.


Asunto(s)
Privacidad , Humanos
13.
AMIA Jt Summits Transl Sci Proc ; 2020: 617-625, 2020.
Artículo en Inglés | MEDLINE | ID: mdl-32477684

RESUMEN

Artificial intelligence enabled medical big data analysis has the potential to revolutionize medical practice from diagnosis and prediction of complex diseases to making recommendations and resource allocation decisions in an evidence-based manner. However, big data comes with big disclosure risks. To preserve privacy, excessive data anonymization is often necessary, leading to significant loss of data utility. In this paper, we develop a systematic data scrubbing procedure for large datasets when key variables are uncertain for re-identification risk assessment and assess the trade-off between anonymization of electronic health record data for sharing in support of open science and performance of machine learning models for early acute kidney injury risk prediction using the data. Results demonstrate that our proposed data scrubbing procedure can maintain good feature diversity and moderate data utility but raises concerns regarding its impact on knowledge discovery capability.

14.
JMIR Med Inform ; 8(7): e18910, 2020 Jul 20.
Artículo en Inglés | MEDLINE | ID: mdl-32501278

RESUMEN

BACKGROUND: The exploitation of synthetic data in health care is at an early stage. Synthetic data could unlock the potential within health care datasets that are too sensitive for release. Several synthetic data generators have been developed to date; however, studies evaluating their efficacy and generalizability are scarce. OBJECTIVE: This work sets out to understand the difference in performance of supervised machine learning models trained on synthetic data compared with those trained on real data. METHODS: A total of 19 open health datasets were selected for experimental work. Synthetic data were generated using three synthetic data generators that apply classification and regression trees, parametric, and Bayesian network approaches. Real and synthetic data were used (separately) to train five supervised machine learning models: stochastic gradient descent, decision tree, k-nearest neighbors, random forest, and support vector machine. Models were tested only on real data to determine whether a model developed by training on synthetic data can used to accurately classify new, real examples. The impact of statistical disclosure control on model performance was also assessed. RESULTS: A total of 92% of models trained on synthetic data have lower accuracy than those trained on real data. Tree-based models trained on synthetic data have deviations in accuracy from models trained on real data of 0.177 (18%) to 0.193 (19%), while other models have lower deviations of 0.058 (6%) to 0.072 (7%). The winning classifier when trained and tested on real data versus models trained on synthetic data and tested on real data is the same in 26% (5/19) of cases for classification and regression tree and parametric synthetic data and in 21% (4/19) of cases for Bayesian network-generated synthetic data. Tree-based models perform best with real data and are the winning classifier in 95% (18/19) of cases. This is not the case for models trained on synthetic data. When tree-based models are not considered, the winning classifier for real and synthetic data is matched in 74% (14/19), 53% (10/19), and 68% (13/19) of cases for classification and regression tree, parametric, and Bayesian network synthetic data, respectively. Statistical disclosure control methods did not have a notable impact on data utility. CONCLUSIONS: The results of this study are promising with small decreases in accuracy observed in models trained with synthetic data compared with models trained with real data, where both are tested on real data. Such deviations are expected and manageable. Tree-based classifiers have some sensitivity to synthetic data, and the underlying cause requires further investigation. This study highlights the potential of synthetic data and the need for further evaluation of their robustness. Synthetic data must ensure individual privacy and data utility are preserved in order to instill confidence in health care departments when using such data to inform policy decision-making.

15.
Front Robot AI ; 7: 96, 2020.
Artículo en Inglés | MEDLINE | ID: mdl-33501263

RESUMEN

Pervasive sensing is increasing our ability to monitor the status of patients not only when they are hospitalized but also during home recovery. As a result, lots of data are collected and are available for multiple purposes. If operations can take advantage of timely and detailed data, the huge amount of data collected can also be useful for analytics. However, these data may be unusable for two reasons: data quality and performance problems. First, if the quality of the collected values is low, the processing activities could produce insignificant results. Second, if the system does not guarantee adequate performance, the results may not be delivered at the right time. The goal of this document is to propose a data utility model that considers the impact of the quality of the data sources (e.g., collected data, biographical data, and clinical history) on the expected results and allows for improvement of the performance through utility-driven data management in a Fog environment. Regarding data quality, our approach aims to consider it as a context-dependent problem: a given dataset can be considered useful for one application and inadequate for another application. For this reason, we suggest a context-dependent quality assessment considering dimensions such as accuracy, completeness, consistency, and timeliness, and we argue that different applications have different quality requirements to consider. The management of data in Fog computing also requires particular attention to quality of service requirements. For this reason, we include QoS aspects in the data utility model, such as availability, response time, and latency. Based on the proposed data utility model, we present an approach based on a goal model capable of identifying when one or more dimensions of quality of service or data quality are violated and of suggesting which is the best action to be taken to address this violation. The proposed approach is evaluated with a real and appropriately anonymized dataset, obtained as part of the experimental procedure of a research project in which a device with a set of sensors (inertial, temperature, humidity, and light sensors) is used to collect motion and environmental data associated with the daily physical activities of healthy young volunteers.

16.
BMC Med Res Methodol ; 19(1): 204, 2019 11 05.
Artículo en Inglés | MEDLINE | ID: mdl-31690260

RESUMEN

BACKGROUND: Clinical study reports (CSRs) have been increasingly utilised within academic research in recent years. European Medicines Agency (EMA) Policy 0070 'Phase 1,' which came into effect in January 2015, requires the publication of regulatory documents such as CSRs from central applications in an anonymised format. EMA Policy 0070 requires sponsors to demonstrate careful consideration of data utility within anonymised CSRs published within the scope of the policy, yet the concept of data utility is not clearly defined in the associated anonymisation guidance. OBJECTIVE: To review the use of data from CSRs in published academic research and to hypothesise the potential data utility of CSRs anonymised under the objectives of EMA Policy 0070 for future academic research. METHODS: Review of the objectives, research methodologies and findings of academic research reports using unpublished data from CSRs (prior to EMA Policy 0070). Semi-structured interviews with authors of academic research reports, including questions related to data utility of anonymised CSRs published under EMA Policy 0070. RESULTS: Thirteen academic research reports were identified and reviewed. The research purposes ranged from assessment of reporting bias, comparison of methods and results with published data sources, detailed evaluation of harms and adverse events, re-analysis and novel analyses including systematic reviews and meta-analysis. All of the examples identified required access to the methods and results sections of CSRs (including aggregated summary tables) and research purposes relating to evaluation of adverse events also required access to participant narratives. Retaining anonymised participant narratives relating to interventions, findings and events, while maintaining an acceptably low risk of participant re-identification, may provide an important gain in data utility and further understanding of drug safety profiles. CONCLUSIONS: This work provides an initial insight into the previous use of CSR data and current practices for including regulatory data in academic research. This work also provides early guidance to qualitatively assess and document data utility within anonymised CSRs published under EMA Policy 0070.


Asunto(s)
Investigación Biomédica/normas , Ensayos Clínicos como Asunto/normas , Informe de Investigación/normas , Evaluación de la Tecnología Biomédica/métodos , Investigación Biomédica/métodos , Industria Farmacéutica/organización & administración , Industria Farmacéutica/normas , Europa (Continente) , Humanos , Publicaciones Periódicas como Asunto/normas , Sesgo de Publicación , Evaluación de la Tecnología Biomédica/organización & administración
17.
Artículo en Inglés | MEDLINE | ID: mdl-31731730

RESUMEN

Patient data or information collected from public health and health care surveys are of great research value. Usually, the data contain sensitive personal information. Doctors, nurses, or researchers in the public health and health care sector do not analyze the available datasets or survey data on their own, and may outsource the tasks to third parties. Even though all identifiers such as names and ID card numbers are removed, there may still be some occasions in which an individual can be re-identified via the demographic or particular information provided in the datasets. Such data privacy issues can become an obstacle in health-related research. Statistical disclosure control (SDC) is a useful technique used to resolve this problem by masking and designing released data based on the original data. Whilst ensuring the released data can satisfy the needs of researchers for data analysis, there is high protection of the original data from disclosure. In this research, we discuss the statistical properties of two SDC methods: the General Additive Data Perturbation (GADP) method and the Gaussian Copula General Additive Data Perturbation (CGADP) method. An empirical study is provided to demonstrate how we can apply these two SDC methods in public health research.


Asunto(s)
Confidencialidad/normas , Interpretación Estadística de Datos , Salud Pública , Proyectos de Investigación , Investigación Empírica , Humanos
18.
J Gen Intern Med ; 34(3): 467-472, 2019 03.
Artículo en Inglés | MEDLINE | ID: mdl-30511288

RESUMEN

Emerging health care research paradigms such as comparative effectiveness research (CER), patient-centered outcome research (PCOR), and precision medicine (PM) share one ultimate goal: constructing evidence to provide the right treatment to the right patient at the right time. We argue that to succeed at this goal, it is crucial to have both timely access to individual-level data and fine geographic granularity in the data. Existing data will continue to be an important resource for observational studies as new data sources are developed. We examined widely used publicly funded health databases and population-based survey systems and found four ways they could be improved to better support the new research paradigms: (1) finer and more consistent geographic granularity, (2) more complete geographic coverage of the US population, (3) shorter time from data collection to data release, and (4) improved environments for restricted data access. We believe that existing data sources, if utilized optimally, and newly developed data infrastructures will both play a key role in expanding our insight into what treatments, at what time, work for each patient.


Asunto(s)
Manejo de Datos/estadística & datos numéricos , Bases de Datos Factuales/estadística & datos numéricos , Evaluación del Resultado de la Atención al Paciente , Salud Pública/estadística & datos numéricos , Investigación sobre la Eficacia Comparativa/economía , Investigación sobre la Eficacia Comparativa/estadística & datos numéricos , Manejo de Datos/economía , Bases de Datos Factuales/economía , Humanos , Medicina de Precisión/economía , Medicina de Precisión/estadística & datos numéricos , Salud Pública/economía , Factores de Tiempo , Estados Unidos/epidemiología
19.
Sensors (Basel) ; 18(7)2018 Jul 23.
Artículo en Inglés | MEDLINE | ID: mdl-30041443

RESUMEN

Having an incentive mechanism is crucial for the recruitment of mobile users to participate in a sensing task and to ensure that participants provide high-quality sensing data. In this paper, we investigate a staged incentive and punishment mechanism for mobile crowd sensing. We first divide the incentive process into two stages: the recruiting stage and the sensing stage. In the recruiting stage, we introduce the payment incentive coefficient and design a Stackelberg-based game method. The participants can be recruited via game interaction. In the sensing stage, we propose a sensing data utility algorithm in the interaction. After the sensing task, the winners can be filtered out using data utility, which is affected by time⁻space correlation. In particular, the participants' reputation accumulation can be carried out based on data utility, and a punishment mechanism is presented to reduce the waste of payment costs caused by malicious participants. Finally, we conduct an extensive study of our solution based on realistic data. Extensive experiments show that compared to the existing positive auction incentive mechanism (PAIM) and reverse auction incentive mechanism (RAIM), our proposed staged incentive mechanism (SIM) can effectively extend the incentive behavior from the recruiting stage to the sensing stage. It not only achieves being a real-time incentive in both the recruiting and sensing stages but also improves the utility of sensing data.

20.
J Biomed Inform ; 50: 107-21, 2014 Aug.
Artículo en Inglés | MEDLINE | ID: mdl-24768775

RESUMEN

Cost-benefit analysis is a prerequisite for making good business decisions. In the business environment, companies intend to make profit from maximizing information utility of published data while having an obligation to protect individual privacy. In this paper, we quantify the trade-off between privacy and data utility in health data publishing in terms of monetary value. We propose an analytical cost model that can help health information custodians (HICs) make better decisions about sharing person-specific health data with other parties. We examine relevant cost factors associated with the value of anonymized data and the possible damage cost due to potential privacy breaches. Our model guides an HIC to find the optimal value of publishing health data and could be utilized for both perturbative and non-perturbative anonymization techniques. We show that our approach can identify the optimal value for different privacy models, including K-anonymity, LKC-privacy, and ∊-differential privacy, under various anonymization algorithms and privacy parameters through extensive experiments on real-life data.


Asunto(s)
Análisis Costo-Beneficio , Registros Electrónicos de Salud , Privacidad , Edición
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA