Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 218
Filtrar
1.
Discov Oncol ; 15(1): 447, 2024 Sep 14.
Artículo en Inglés | MEDLINE | ID: mdl-39277568

RESUMEN

BACKGROUND: Early detection of T790M mutation in exon 20 of epidermal growth factor receptor (EGFR) in non-small cell lung cancer (NSCLC) patients with brain metastasis is crucial for optimizing treatment strategies. In this study, we developed radiomics models to distinguish NSCLC patients with T790M-positive mutations from those with T790M-negative mutations using multisequence MR images of brain metastasis despite an imbalanced dataset. Various resampling techniques and classifiers were employed to identify the most effective strategy. METHODS: Radiomic analyses were conducted on a dataset comprising 125 patients, consisting of 18 with EGFR T790M-positive mutations and 107 with T790M-negative mutations. Seventeen first- and second-order statistical features were selected from CET1WI, T2WI, T2FLAIR, and DWI images. Four classifiers (logistic regression, support vector machine, random forest [RF], and extreme gradient boosting [XGBoost]) were evaluated under 13 different resampling conditions. RESULTS: The area under the curve (AUC) value achieved was 0.89, using the SVM-SMOTE oversampling method in combination with the XGBoost classifier. This performance was measured against the AUC reported in the literature, serving as an upper-bound reference. Additionally, comparable results were observed with other oversampling methods paired with RF or XGBoost classifiers. CONCLUSIONS: Our study demonstrates that, even when dealing with an imbalanced EGFR T790M dataset, reasonable predictive outcomes can be achieved by employing an appropriate combination of resampling techniques and classifiers. This approach has significant potential for enhancing T790M mutation detection in NSCLC patients with brain metastasis.

2.
Comput Biol Chem ; 113: 108203, 2024 Sep 02.
Artículo en Inglés | MEDLINE | ID: mdl-39244896

RESUMEN

OBJECTIVE: The prediction of sepsis, especially early diagnosis, has received a significant attention in biomedical research. In order to improve current medical scoring system and overcome the limitations of class imbalance and sample size of local EHR (electronic health records), we propose a novel knowledge-transfer-based approach, which combines a medical scoring system and an ordinal logistic regression model. MATERIALS AND METHODS: Medical scoring systems (i.e. NEWS, SIRS and QSOFA) are generally robust and useful for sepsis diagnosis. With local EHR, machine-learning-based methods have been widely used for building prediction models/methods, but they are often impacted by class imbalance and sample size. Knowledge distillation and knowledge transfer have recently been proposed as a combination approach for improving the prediction performance and model generalization. In this study, we developed a novel knowledge-transfer-based method for combining a medical scoring system (after a proposed score transformation) and an ordinal logistic regression model. We mathematically confirmed that it was equivalent to a specific form of the weighted regression. Furthermore, we theoretically explored its effectiveness in the scenario of class imbalance. RESULTS: For the local dataset and the MIMIC-IV dataset, the VUS (the volume under the multi-dimensional ROC surface, a generalization measure of AUC-ROC for ordinal categories) of the knowledge-transfer-based model (ORNEWS) based on the NEWS scoring system were 0.384 and 0.339, respectively, while the VUS of the traditional ordinal regression model (OR) were 0.352 and 0.322, respectively. Consistent analysis results were also observed for the knowledge-transfer-based models based on the SIRS/QSOFA scoring systems in the ordinal scenarios. Additionally, the predicted probabilities and the binary classification ROC curves of the knowledge-transfer-based models indicated that this approach enhanced the predicted probabilities for the minority classes while reducing the predicted probabilities for the majority classes, which improved AUCs/VUSs on imbalanced data. DISCUSSION: Knowledge transfer, which combines a medical scoring system and a machine-learning-based model, improves the prediction performance for early diagnosis of sepsis, especially in the scenarios of class imbalance and limited sample size.

3.
BioData Min ; 17(1): 29, 2024 Sep 04.
Artículo en Inglés | MEDLINE | ID: mdl-39232851

RESUMEN

OBJECTIVE: Data imbalance is a pervasive issue in medical data mining, often leading to biased and unreliable predictive models. This study aims to address the urgent need for effective strategies to mitigate the impact of data imbalance on classification models. We focus on quantifying the effects of different imbalance degrees and sample sizes on model performance, identifying optimal cut-off values, and evaluating the efficacy of various methods to enhance model accuracy in highly imbalanced and small sample size scenarios. METHODS: We collected medical records of patients receiving assisted reproductive treatment in a reproductive medicine center. Random forest was used to screen the key variables for the prediction target. Various datasets with different imbalance degrees and sample sizes were constructed to compare the classification performance of logistic regression models. Metrics such as AUC, G-mean, F1-Score, Accuracy, Recall, and Precision were used for evaluation. Four imbalance treatment methods (SMOTE, ADASYN, OSS, and CNN) were applied to datasets with low positive rates and small sample sizes to assess their effectiveness. RESULTS: The logistic model's performance was low when the positive rate was below 10% but stabilized beyond this threshold. Similarly, sample sizes below 1200 yielded poor results, with improvement seen above this threshold. For robustness, the optimal cut-offs for positive rate and sample size were identified as 15% and 1500, respectively. SMOTE and ADASYN oversampling significantly improved classification performance in datasets with low positive rates and small sample sizes. CONCLUSIONS: The study identifies a positive rate of 15% and a sample size of 1500 as optimal cut-offs for stable logistic model performance. For datasets with low positive rates and small sample sizes, SMOTE and ADASYN are recommended to improve balance and model accuracy.

4.
ACS Appl Mater Interfaces ; 16(33): 43734-43741, 2024 Aug 21.
Artículo en Inglés | MEDLINE | ID: mdl-39121441

RESUMEN

Applying machine-learning techniques for imbalanced data sets presents a significant challenge in materials science since the underrepresented characteristics of minority classes are often buried by the abundance of unrelated characteristics in majority of classes. Existing approaches to address this focus on balancing the counts of each class using oversampling or synthetic data generation techniques. However, these methods can lead to loss of valuable information or overfitting. Here, we introduce a deep learning framework to predict minority-class materials, specifically within the realm of metal-insulator transition (MIT) materials. The proposed approach, termed boosting-CGCNN, combines the crystal graph convolutional neural network (CGCNN) model with a gradient-boosting algorithm. The model effectively handled extreme class imbalances in MIT material data by sequentially building a deeper neural network. The comparative evaluations demonstrated the superior performance of the proposed model compared to other approaches. Our approach is a promising solution for handling imbalanced data sets in materials science.

5.
PeerJ Comput Sci ; 10: e2188, 2024.
Artículo en Inglés | MEDLINE | ID: mdl-39145237

RESUMEN

The enhancement of fabric quality prediction in the textile manufacturing sector is achieved by utilizing information derived from sensors within the Internet of Things (IoT) and Enterprise Resource Planning (ERP) systems linked to sensors embedded in textile machinery. The integration of Industry 4.0 concepts is instrumental in harnessing IoT sensor data, which, in turn, leads to improvements in productivity and reduced lead times in textile manufacturing processes. This study addresses the issue of imbalanced data pertaining to fabric quality within the textile manufacturing industry. It encompasses an evaluation of seven open-source automated machine learning (AutoML) technologies, namely FLAML (Fast Lightweight AutoML), AutoViML (Automatically Build Variant Interpretable ML models), EvalML (Evaluation Machine Learning), AutoGluon, H2OAutoML, PyCaret, and TPOT (Tree-based Pipeline Optimization Tool). The most suitable solutions are chosen for certain circumstances by employing an innovative approach that finds a compromise among computational efficiency and forecast accuracy. The results reveal that EvalML emerges as the top-performing AutoML model for a predetermined objective function, particularly excelling in terms of mean absolute error (MAE). On the other hand, even with longer inference periods, AutoGluon performs better than other methods in measures like mean absolute percentage error (MAPE), root mean squared error (RMSE), and r-squared. Additionally, the study explores the feature importance rankings provided by each AutoML model, shedding light on the attributes that significantly influence predictive outcomes. Notably, sin/cos encoding is found to be particularly effective in characterizing categorical variables with a large number of unique values. This study includes useful information about the application of AutoML in the textile industry and provides a roadmap for employing Industry 4.0 technologies to enhance fabric quality prediction. The research highlights the importance of striking a balance between predictive accuracy and computational efficiency, emphasizes the significance of feature importance for model interpretability, and lays the groundwork for future investigations in this field.

6.
Diagnostics (Basel) ; 14(16)2024 Aug 08.
Artículo en Inglés | MEDLINE | ID: mdl-39202215

RESUMEN

INTRODUCTION: Convolutional Neural Network (CNN) systems in healthcare are influenced by unbalanced datasets and varying sizes. This article delves into the impact of dataset size, class imbalance, and their interplay on CNN systems, focusing on the size of the training set versus imbalance-a unique perspective compared to the prevailing literature. Furthermore, it addresses scenarios with more than two classification groups, often overlooked but prevalent in practical settings. METHODS: Initially, a CNN was developed to classify lung diseases using X-ray images, distinguishing between healthy individuals and COVID-19 patients. Later, the model was expanded to include pneumonia patients. To evaluate performance, numerous experiments were conducted with varied data sizes and imbalance ratios for both binary and ternary classifications, measuring various indices to validate the model's efficacy. RESULTS: The study revealed that increasing dataset size positively impacts CNN performance, but this improvement saturates beyond a certain size. A novel finding is that the data balance ratio influences performance more significantly than dataset size. The behavior of three-class classification mirrored that of binary classification, underscoring the importance of balanced datasets for accurate classification. CONCLUSIONS: This study emphasizes the fact that achieving balanced representation in datasets is crucial for optimal CNN performance in healthcare, challenging the conventional focus on dataset size. Balanced datasets improve classification accuracy, both in two-class and three-class scenarios, highlighting the need for data-balancing techniques to improve model reliability and effectiveness. MOTIVATION: Our study is motivated by a scenario with 100 patient samples, offering two options: a balanced dataset with 200 samples and an unbalanced dataset with 500 samples (400 healthy individuals). We aim to provide insights into the optimal choice based on the interplay between dataset size and imbalance, enriching the discourse for stakeholders interested in achieving optimal model performance. LIMITATIONS: Recognizing a single model's generalizability limitations, we assert that further studies on diverse datasets are needed.

7.
Stud Health Technol Inform ; 316: 929-933, 2024 Aug 22.
Artículo en Inglés | MEDLINE | ID: mdl-39176944

RESUMEN

Predictive modeling holds a large potential in clinical decision-making, yet its effectiveness can be hindered by inherent data imbalances in clinical datasets. This study investigates the utility of synthetic data for improving the performance of predictive modeling on realistic small imbalanced clinical datasets. We compared various synthetic data generation methods including Generative Adversarial Networks, Normalizing Flows, and Variational Autoencoders to the standard baselines for correcting for class underrepresentation on four clinical datasets. Although results show improvement in F1 scores in some cases, even over multiple repetitions, we do not obtain statistically significant evidence that synthetic data generation outperforms standard baselines for correcting for class imbalance. This study challenges common beliefs about the efficacy of synthetic data for data augmentation and highlights the importance of evaluating new complex methods against simple baselines.


Asunto(s)
Toma de Decisiones Clínicas , Humanos
8.
J Med Internet Res ; 26: e48595, 2024 Jul 30.
Artículo en Inglés | MEDLINE | ID: mdl-39079116

RESUMEN

BACKGROUND: Under- or late identification of pulmonary embolism (PE)-a thrombosis of 1 or more pulmonary arteries that seriously threatens patients' lives-is a major challenge confronting modern medicine. OBJECTIVE: We aimed to establish accurate and informative machine learning (ML) models to identify patients at high risk for PE as they are admitted to the hospital, before their initial clinical checkup, by using only the information in their medical records. METHODS: We collected demographics, comorbidities, and medications data for 2568 patients with PE and 52,598 control patients. We focused on data available prior to emergency department admission, as these are the most universally accessible data. We trained an ML random forest algorithm to detect PE at the earliest possible time during a patient's hospitalization-at the time of his or her admission. We developed and applied 2 ML-based methods specifically to address the data imbalance between PE and non-PE patients, which causes misdiagnosis of PE. RESULTS: The resulting models predicted PE based on age, sex, BMI, past clinical PE events, chronic lung disease, past thrombotic events, and usage of anticoagulants, obtaining an 80% geometric mean value for the PE and non-PE classification accuracies. Although on hospital admission only 4% (1942/46,639) of the patients had a diagnosis of PE, we identified 2 clustering schemes comprising subgroups with more than 61% (705/1120 in clustering scheme 1; 427/701 and 340/549 in clustering scheme 2) positive patients for PE. One subgroup in the first clustering scheme included 36% (705/1942) of all patients with PE who were characterized by a definite past PE diagnosis, a 6-fold higher prevalence of deep vein thrombosis, and a 3-fold higher prevalence of pneumonia, compared with patients of the other subgroups in this scheme. In the second clustering scheme, 2 subgroups (1 of only men and 1 of only women) included patients who all had a past PE diagnosis and a relatively high prevalence of pneumonia, and a third subgroup included only those patients with a past diagnosis of pneumonia. CONCLUSIONS: This study established an ML tool for early diagnosis of PE almost immediately upon hospital admission. Despite the highly imbalanced scenario undermining accurate PE prediction and using information available only from the patient's medical history, our models were both accurate and informative, enabling the identification of patients already at high risk for PE upon hospital admission, even before the initial clinical checkup was performed. The fact that we did not restrict our patients to those at high risk for PE according to previously published scales (eg, Wells or revised Genova scores) enabled us to accurately assess the application of ML on raw medical data and identify new, previously unidentified risk factors for PE, such as previous pulmonary disease, in general populations.


Asunto(s)
Aprendizaje Automático , Embolia Pulmonar , Humanos , Embolia Pulmonar/diagnóstico , Masculino , Factores de Riesgo , Femenino , Persona de Mediana Edad , Anciano , Diagnóstico Precoz , Hospitalización/estadística & datos numéricos , Adulto , Admisión del Paciente/estadística & datos numéricos
9.
Artículo en Inglés | MEDLINE | ID: mdl-39069119

RESUMEN

OBJECTIVE: The study objective was to develop comprehensive quality assurance models for procedural outcomes after adult cardiac surgery. METHODS: Based on 52,792 cardiac operations in adults performed in 19 hospitals of 3 high-performing hospital systems, models were developed for operative mortality (n = 1271), stroke (n = 895), deep sternal wound infection (n = 122), prolonged intubation (6182), renal failure (1265), prolonged postoperative stay (n = 5418), and reoperations (n = 1693). Random forest quantile classification, a method tailored for challenges of rare events, and model-free variable priority screening were used to identify predictors of events. RESULTS: A small set of preoperative variables was sufficient to model procedural outcomes for virtually all cardiac operations, including older age; advanced symptoms; left ventricular, pulmonary, renal, and hepatic dysfunction; lower albumin; higher acuity; and greater complexity of the planned operation. Geometric mean performance ranged from .63 to .76. Calibration covered large areas of probability. Continuous risk factors provided high information content, and their association with outcomes was visualized with partial plots. These risk factors differed in strength and configuration among hospitals, as did their risk-adjusted outcomes according to patient risk as determined by counterfactual causal inference within a framework of virtual (digital) twins. CONCLUSIONS: By using a small set of variables and contemporary machine-learning methods, comprehensive models for procedural operative mortality and major morbidity after adult cardiac surgery were developed based on data from 3 exemplary hospital systems. They provide surgeons, their patients, and hospital systems with 21st century tools for assessing their risks compared with these advanced hospital systems and improving cardiac surgery quality.

10.
Patterns (N Y) ; 5(6): 100994, 2024 Jun 14.
Artículo en Inglés | MEDLINE | ID: mdl-39005487

RESUMEN

Many problems in biology require looking for a "needle in a haystack," corresponding to a binary classification where there are a few positives within a much larger set of negatives, which is referred to as a class imbalance. The receiver operating characteristic (ROC) curve and the associated area under the curve (AUC) have been reported as ill-suited to evaluate prediction performance on imbalanced problems where there is more interest in performance on the positive minority class, while the precision-recall (PR) curve is preferable. We show via simulation and a real case study that this is a misinterpretation of the difference between the ROC and PR spaces, showing that the ROC curve is robust to class imbalance, while the PR curve is highly sensitive to class imbalance. Furthermore, we show that class imbalance cannot be easily disentangled from classifier performance measured via PR-AUC.

11.
Eur J Med Res ; 29(1): 383, 2024 Jul 25.
Artículo en Inglés | MEDLINE | ID: mdl-39054495

RESUMEN

BACKGROUND: Tuberculosis spondylitis (TS), commonly known as Pott's disease, is a severe type of skeletal tuberculosis that typically requires surgical treatment. However, this treatment option has led to an increase in healthcare costs due to prolonged hospital stays (PLOS). Therefore, identifying risk factors associated with extended PLOS is necessary. In this research, we intended to develop an interpretable machine learning model that could predict extended PLOS, which can provide valuable insights for treatments and a web-based application was implemented. METHODS: We obtained patient data from the spine surgery department at our hospital. Extended postoperative length of stay (PLOS) refers to a hospitalization duration equal to or exceeding the 75th percentile following spine surgery. To identify relevant variables, we employed several approaches, such as the least absolute shrinkage and selection operator (LASSO), recursive feature elimination (RFE) based on support vector machine classification (SVC), correlation analysis, and permutation importance value. Several models using implemented and some of them are ensembled using soft voting techniques. Models were constructed using grid search with nested cross-validation. The performance of each algorithm was assessed through various metrics, including the AUC value (area under the curve of receiver operating characteristics) and the Brier Score. Model interpretation involved utilizing methods such as Shapley additive explanations (SHAP), the Gini Impurity Index, permutation importance, and local interpretable model-agnostic explanations (LIME). Furthermore, to facilitate the practical application of the model, a web-based interface was developed and deployed. RESULTS: The study included a cohort of 580 patients and 11 features include (CRP, transfusions, infusion volume, blood loss, X-ray bone bridge, X-ray osteophyte, CT-vertebral destruction, CT-paravertebral abscess, MRI-paravertebral abscess, MRI-epidural abscess, postoperative drainage) were selected. Most of the classifiers showed better performance, where the XGBoost model has a higher AUC value (0.86) and lower Brier Score (0.126). The XGBoost model was chosen as the optimal model. The results obtained from the calibration and decision curve analysis (DCA) plots demonstrate that XGBoost has achieved promising performance. After conducting tenfold cross-validation, the XGBoost model demonstrated a mean AUC of 0.85 ± 0.09. SHAP and LIME were used to display the variables' contributions to the predicted value. The stacked bar plots indicated that infusion volume was the primary contributor, as determined by Gini, permutation importance (PFI), and the LIME algorithm. CONCLUSIONS: Our methods not only effectively predicted extended PLOS but also identified risk factors that can be utilized for future treatments. The XGBoost model developed in this study is easily accessible through the deployed web application and can aid in clinical research.


Asunto(s)
Tiempo de Internación , Aprendizaje Automático , Tuberculosis de la Columna Vertebral , Humanos , Masculino , Femenino , Tuberculosis de la Columna Vertebral/cirugía , Persona de Mediana Edad , Inteligencia Artificial , Adulto , Espondilitis/cirugía , Espondilitis/microbiología , Algoritmos
12.
Int J Mol Sci ; 25(12)2024 Jun 14.
Artículo en Inglés | MEDLINE | ID: mdl-38928278

RESUMEN

G-protein coupled receptors (GPCRs) are transmembrane proteins that transmit signals from the extracellular environment to the inside of the cells. Their ability to adopt various conformational states, which influence their function, makes them crucial in pharmacoproteomic studies. While many drugs target specific GPCR states to exert their effects-thereby regulating the protein's activity-unraveling the activation pathway remains challenging due to the multitude of intermediate transformations occurring throughout this process, and intrinsically influencing the dynamics of the receptors. In this context, computational modeling, particularly molecular dynamics (MD) simulations, may offer valuable insights into the dynamics and energetics of GPCR transformations, especially when combined with machine learning (ML) methods and techniques for achieving model interpretability for knowledge generation. The current study builds upon previous work in which the layer relevance propagation (LRP) technique was employed to interpret the predictions in a multi-class classification problem concerning the conformational states of the ß2-adrenergic (ß2AR) receptor from MD simulations. Here, we address the challenges posed by class imbalance and extend previous analyses by evaluating the robustness and stability of deep learning (DL)-based predictions under different imbalance mitigation techniques. By meticulously evaluating explainability and imbalance strategies, we aim to produce reliable and robust insights.


Asunto(s)
Aprendizaje Profundo , Simulación de Dinámica Molecular , Conformación Proteica , Receptores Adrenérgicos beta 2 , Receptores Acoplados a Proteínas G , Receptores Adrenérgicos beta 2/química , Receptores Adrenérgicos beta 2/metabolismo , Receptores Acoplados a Proteínas G/química , Receptores Acoplados a Proteínas G/metabolismo , Humanos
13.
J Biopharm Stat ; : 1-14, 2024 Jun 11.
Artículo en Inglés | MEDLINE | ID: mdl-38860696

RESUMEN

Accurate prediction of a rare and clinically important event following study treatment has been crucial in drug development. For instance, the rarity of an adverse event is often commensurate with the seriousness of medical consequences, and delayed detection of the rare adverse event can pose significant or even life-threatening health risks to patients. In this machine learning case study, we demonstrate with an example originated from a real clinical trial setting how to define and solve the rare clinical event prediction problem using machine learning in pharmaceutical industry. The unique contributions of this work include the proposal of a six-step investigation framework that facilitates the communication with non-technical stakeholders and the interpretation of the model performance in terms of practical consequences in the context of patient screenings for conducting a future clinical trial. In terms of machine learning methodology, for data splitting into the training and test sets, we adapt the rare-event stratified split approach (from scikit-learn) to further account for group splitting for multiple records of a patient simultaneously. To handle imbalanced data due to rare events in model training, the cost-sensitive learning approach is employed to give more weights to the minor class and the metrics precision together with recall are used to capture prediction performance instead of the raw accuracy rate. Finally, we demonstrate how to apply the state-of-the-art SHAP values to identify important risk factors to improve model interpretability.

14.
Accid Anal Prev ; 203: 107614, 2024 Aug.
Artículo en Inglés | MEDLINE | ID: mdl-38781631

RESUMEN

Vulnerable Road Users (VRUs), such as pedestrians and bicyclists, are at a higher risk of being involved in crashes with motor vehicles, and crashes involving VRUs also are more likely to result in severe injuries or fatalities. Signalized intersections are a major safety concern for VRUs due to their complex dynamics, emphasizing the need to understand how these road users interact with motor vehicles and deploy evidence-based safety countermeasures. Given the infrequency of VRU-related crashes, identifying conflicts between VRUs and motorized vehicles as surrogate safety indicators offers an alternative approach. Automatically detecting these conflicts using a video-based system is a crucial step in developing smart infrastructure to enhance VRU safety. However, further research is required to enhance its reliability and accuracy. Building upon a study conducted by the Pennsylvania Department of Transportation (PennDOT), which utilized a video-based event monitoring system to assess VRU and motor vehicle interactions at fifteen signalized intersections in Pennsylvania, this research aims to evaluate the reliability of automatically generated surrogates in predicting confirmed conflicts without human supervision, employing advanced data-driven models such as logistic regression and tree-based algorithms. The surrogate data used for this analysis includes automatically collectable variables such as vehicular and VRU speeds, movements, post-encroachment time, in addition to manually collected variables like signal states, lighting, and weather conditions. To address data scarcity challenges, synthetic data augmentation techniques are used to balance the dataset and enhance model robustness. The findings highlight the varying importance and impact of specific surrogates in predicting true conflicts, with some surrogates proving more informative than others. Additionally, the research examines the distinctions between significant variables in identifying bicycle and pedestrian conflicts. These findings can assist transportation agencies to collect the right types of data to help prioritize infrastructure investments, such as bike lanes and crosswalks, and evaluate their effectiveness.


Asunto(s)
Accidentes de Tránsito , Ciclismo , Peatones , Grabación en Video , Humanos , Ciclismo/lesiones , Accidentes de Tránsito/prevención & control , Accidentes de Tránsito/estadística & datos numéricos , Reproducibilidad de los Resultados , Caminata/lesiones , Pennsylvania , Planificación Ambiental , Seguridad , Vehículos a Motor
15.
Mach Learn ; 113(5): 2655-2674, 2024.
Artículo en Inglés | MEDLINE | ID: mdl-38708086

RESUMEN

With the rapid growth of memory and computing power, datasets are becoming increasingly complex and imbalanced. This is especially severe in the context of clinical data, where there may be one rare event for many cases in the majority class. We introduce an imbalanced classification framework, based on reinforcement learning, for training extremely imbalanced data sets, and extend it for use in multi-class settings. We combine dueling and double deep Q-learning architectures, and formulate a custom reward function and episode-training procedure, specifically with the capability of handling multi-class imbalanced training. Using real-world clinical case studies, we demonstrate that our proposed framework outperforms current state-of-the-art imbalanced learning methods, achieving more fair and balanced classification, while also significantly improving the prediction of minority classes. Supplementary Information: The online version contains supplementary material available at 10.1007/s10994-023-06481-z.

16.
Front Physiol ; 15: 1362185, 2024.
Artículo en Inglés | MEDLINE | ID: mdl-38655032

RESUMEN

Introduction: Atrial fibrillation (AF) is the most common cardiac arrhythmia, which is clinically identified with irregular and rapid heartbeat rhythm. AF puts a patient at risk of forming blood clots, which can eventually lead to heart failure, stroke, or even sudden death. Electrocardiography (ECG), which involves acquiring bioelectrical signals from the body surface to reflect heart activity, is a standard procedure for detecting AF. However, the occurrence of AF is often intermittent, costing a significant amount of time and effort from medical doctors to identify AF episodes. Moreover, human error is inevitable, as even experienced medical professionals can overlook or misinterpret subtle signs of AF. As such, it is of critical importance to develop an advanced analytical model that can automatically interpret ECG signals and provide decision support for AF diagnostics. Methods: In this paper, we propose an innovative deep-learning method for automated AF identification using single-lead ECGs. We first extract time-frequency features from ECG signals using continuous wavelet transform (CWT). Second, the convolutional neural networks enhanced with residual learning (ReNet) are employed as the functional approximator to interpret the time-frequency features extracted by CWT. Third, we propose to incorporate a multi-branching structure into the ResNet to address the issue of class imbalance, where normal ECGs significantly outnumber instances of AF in ECG datasets. Results and Discussion: We evaluate the proposed Multi-branching Resnet with CWT (CWT-MB-Resnet) with two ECG datasets, i.e., PhysioNet/CinC challenge 2017 and ECGs obtained from the University of Oklahoma Health Sciences Center (OUHSC). The proposed CWT-MB-Resnet demonstrates robust prediction performance, achieving an F1 score of 0.8865 for the PhysioNet dataset and 0.7369 for the OUHSC dataset. The experimental results signify the model's superior capability in balancing precision and recall, which is a desired attribute for ensuring reliable medical diagnoses.

17.
Clin Exp Med ; 24(1): 73, 2024 Apr 10.
Artículo en Inglés | MEDLINE | ID: mdl-38598013

RESUMEN

BACKGROUND: Personalized medicine offers targeted therapy options for cancer treatment. However, the decision whether to include a patient into next-generation sequencing (NGS) testing is not standardized. This may result in some patients receiving unnecessary testing while others who could benefit from it are not tested. Typically, patients who have exhausted conventional treatment options are of interest for consideration in molecularly targeted therapy. To assist clinicians in decision-making, we developed a decision support tool using routine data from a precision oncology program. METHODS: We trained a machine learning model on clinical data to determine whether molecular profiling should be performed for a patient. To validate the model, the model's predictions were compared with decisions made by a molecular tumor board (MTB) using multiple patient case vignettes with their characteristics. RESULTS: The prediction model included 440 patients with molecular profiling and 13,587 patients without testing. High area under the curve (AUC) scores indicated the importance of engineered features in deciding on molecular profiling. Patient age, physical condition, tumor type, metastases, and previous therapies were the most important features. During the validation MTB experts made the same decision of recommending a patient for molecular profiling only in 10 out of 15 of their previous cases but there was agreement between the experts and the model in 9 out of 15 cases. CONCLUSION: Based on a historical cohort, our predictive model has the potential to assist clinicians in deciding whether to perform molecular profiling.


Asunto(s)
Neoplasias , Humanos , Neoplasias/diagnóstico , Neoplasias/genética , Datos de Salud Recolectados Rutinariamente , Medicina de Precisión , Aprendizaje Automático , Terapia Molecular Dirigida
18.
Heliyon ; 10(5): e26977, 2024 Mar 15.
Artículo en Inglés | MEDLINE | ID: mdl-38463780

RESUMEN

Identification of self-care problems in children is a challenging task for medical professionals owing to its complexity and time consumption. Furthermore, the shortage of occupational therapists worldwide makes the task more challenging. Machine learning methods have come to the aid of reducing the complexity associated with problems in diverse fields. This paper employs machine learning based models to identify whether a child suffers from self-care problems using SCADI dataset. The dataset exhibited high dimensionality and imbalance. Initially, the dataset was converted into lower dimensionality. Imbalanced dataset is likely to affect the performance of machine learning models. To address this issue, SMOTE oversampling method was used to reduce the wide variations in the class distribution. The classification methods used were Naïve bayes, J48 and random forest. Random forest classifier which was operated on SMOTE balanced data obtained the best classification performance with balanced accuracy of 99%. The classification model outperformed the existing expert systems.

19.
Accid Anal Prev ; 199: 107526, 2024 May.
Artículo en Inglés | MEDLINE | ID: mdl-38432064

RESUMEN

Drivers who perform frequent high-risk events (e.g., hard braking maneuvers) pose a significant threat to traffic safety. Existing studies commonly estimated high-risk event occurrence probabilities based upon the assumption that data collected from different time periods are independent and identically distributed (referred to as i.i.d. assumption). Such approach ignored the issue of driving behavior temporal covariate shift, where the distributions of driving behavior factors vary over time. To fill the gap, this study targets at obtaining time-invariant driving behavior features and establishing their relationships with high-risk event occurrence probability. Specifically, a generalized modeling framework consisting of distribution characterization (DC) and distribution matching (DM) modules was proposed. The DC module split the whole dataset into several segments with the largest distribution gaps, while the DM module identified time-invariant driving behavior features through learning common knowledge among different segments. Then, gated recurrent unit (GRU) was employed to conduct time-invariant driving behavior feature mining for high-risk event occurrence probability estimation. Moreover, modified loss functions were introduced for imbalanced data learning caused by the rarity of high-risk events. The empirical analyses were conducted utilizing online ride-hailing services data. Experiment results showed that the proposed generalized modeling framework provided a 7.2% higher average precision compared to the traditional i.i.d. assumption based approach. The modified loss functions further improved the model performance by 3.8%. Finally, benefits for the driver management program improvement have been explored by a case study, demonstrating a 33.34% enhancement in the identification precision of high-risk event prone drivers.


Asunto(s)
Accidentes de Tránsito , Conocimiento , Humanos , Accidentes de Tránsito/prevención & control , Aprendizaje , Probabilidad
20.
Math Biosci Eng ; 21(3): 4309-4327, 2024 Feb 26.
Artículo en Inglés | MEDLINE | ID: mdl-38549329

RESUMEN

Due to their high bias in favor of the majority class, traditional machine learning classifiers face a great challenge when there is a class imbalance in biological data. More recently, generative adversarial networks (GANs) have been applied to imbalanced data classification. For GANs, the distribution of the minority class data fed into discriminator is unknown. The input to the generator is random noise ($ z $) drawn from a standard normal distribution $ N(0, 1) $. This method inevitably increases the training difficulty of the network and reduces the quality of the data generated. In order to solve this problem, we proposed a new oversampling algorithm by combining the Bootstrap method and the Wasserstein GAN Network (BM-WGAN). In our approach, the input to the generator network is the data ($ z $) drawn from the distribution of minority class estimated by the BM. The generator was used to synthesize minority class data when the network training is completed. Through the above steps, the generator model can learn the useful features from the minority class and generate realistic-looking minority class samples. The experimental results indicate that BM-WGAN improves the classification performance greatly compared to other oversampling algorithms. The BM-WGAN implementation is available at: https://github.com/ithbjgit1/BMWGAN.git.

SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA