Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 17.918
Filtrar
1.
Elife ; 132024 Sep 06.
Artículo en Inglés | MEDLINE | ID: mdl-39240757

RESUMEN

Theoretical computational models are widely used to describe latent cognitive processes. However, these models do not equally explain data across participants, with some individuals showing a bigger predictive gap than others. In the current study, we examined the use of theory-independent models, specifically recurrent neural networks (RNNs), to classify the source of a predictive gap in the observed data of a single individual. This approach aims to identify whether the low predictability of behavioral data is mainly due to noisy decision-making or misspecification of the theoretical model. First, we used computer simulation in the context of reinforcement learning to demonstrate that RNNs can be used to identify model misspecification in simulated agents with varying degrees of behavioral noise. Specifically, both prediction performance and the number of RNN training epochs (i.e., the point of early stopping) can be used to estimate the amount of stochasticity in the data. Second, we applied our approach to an empirical dataset where the actions of low IQ participants, compared with high IQ participants, showed lower predictability by a well-known theoretical model (i.e., Daw's hybrid model for the two-step task). Both the predictive gap and the point of early stopping of the RNN suggested that model misspecification is similar across individuals. This led us to a provisional conclusion that low IQ subjects are mostly noisier compared to their high IQ peers, rather than being more misspecified by the theoretical model. We discuss the implications and limitations of this approach, considering the growing literature in both theoretical and data-driven computational modeling in decision-making science.


Asunto(s)
Conducta de Elección , Redes Neurales de la Computación , Humanos , Conducta de Elección/fisiología , Simulación por Computador , Procesos Estocásticos , Refuerzo en Psicología , Masculino , Femenino , Toma de Decisiones/fisiología , Adulto , Adulto Joven
2.
Sci Adv ; 10(36): eadi7137, 2024 Sep 06.
Artículo en Inglés | MEDLINE | ID: mdl-39241065

RESUMEN

Contemporary theories guiding the search for neural mechanisms of learning and memory assume that associative learning results from the temporal pairing of cues and reinforcers resulting in coincident activation of associated neurons, strengthening their synaptic connection. While enduring, this framework has limitations: Temporal pairing-based models of learning do not fit with many experimental observations and cannot be used to make quantitative predictions about behavior. Here, we present behavioral data that support an alternative, information-theoretic conception: The amount of information that cues provide about the timing of reward delivery predicts behavior. Furthermore, this approach accounts for the rate and depth of both inhibitory and excitatory learning across paradigms and species. We also show that dopamine release in the ventral striatum reflects cue-predicted changes in reinforcement rates consistent with subjects understanding temporal relationships between task events. Our results reshape the conceptual and biological framework for understanding associative learning.


Asunto(s)
Señales (Psicología) , Dopamina , Aprendizaje , Dopamina/metabolismo , Animales , Aprendizaje/fisiología , Masculino , Recompensa , Aprendizaje por Asociación/fisiología , Ratas , Humanos , Refuerzo en Psicología
3.
Learn Mem ; 31(8)2024 Aug.
Artículo en Inglés | MEDLINE | ID: mdl-39284619

RESUMEN

"Pavlovian" or "motivational" biases are the phenomenon that the valence of prospective outcomes modulates action invigoration: the prospect of reward invigorates actions, while the prospect of punishment suppresses actions. Effects of the valence of prospective outcomes are well established, but it remains unclear how the magnitude of outcomes ("stake magnitude") modulates these biases. In this preregistered study (N = 55), we manipulated stake magnitude (high vs. low) in an orthogonalized Motivational Go/NoGo Task. We tested whether higher stakes (a) strengthen biases or (b) elicit cognitive control recruitment, enhancing the suppression of biases in motivationally incongruent conditions. Confirmatory tests showed that high stakes slowed down responding, especially in motivationally incongruent conditions. However, high stakes did not affect whether a response was made or not, and did not change the magnitude of Pavlovian biases. Reinforcement-learning drift-diffusion models (RL-DDMs) fit to the data suggested that response slowing was best captured by stakes prolonging the non-decision time. There was no effect of the stakes on the response threshold (as in typical speed-accuracy trade-offs). In sum, these results suggest that high stakes slow down responses without affecting the expression of Pavlovian biases in behavior. We speculate that this slowing under high stakes might reflect heightened cognitive control, which is however ineffectively used, or reflect positive conditioned suppression, i.e., the interference between goal-directed and consummatory behaviors, a phenomenon previously observed in rodents that might also exist in humans. Pavlovian biases and slowing under high stakes may arise in parallel to each other.


Asunto(s)
Condicionamiento Clásico , Motivación , Recompensa , Humanos , Masculino , Motivación/fisiología , Adulto Joven , Femenino , Condicionamiento Clásico/fisiología , Adulto , Tiempo de Reacción/fisiología , Adolescente , Castigo , Refuerzo en Psicología , Desempeño Psicomotor/fisiología
4.
PLoS Comput Biol ; 20(9): e1012404, 2024 Sep.
Artículo en Inglés | MEDLINE | ID: mdl-39231162

RESUMEN

Humans tend to give more weight to information confirming their beliefs than to information that disconfirms them. Nevertheless, this apparent irrationality has been shown to improve individual decision-making under uncertainty. However, little is known about this bias' impact on decision-making in a social context. Here, we investigate the conditions under which confirmation bias is beneficial or detrimental to decision-making under social influence. To do so, we develop a Collective Asymmetric Reinforcement Learning (CARL) model in which artificial agents observe others' actions and rewards, and update this information asymmetrically. We use agent-based simulations to study how confirmation bias affects collective performance on a two-armed bandit task, and how resource scarcity, group size and bias strength modulate this effect. We find that a confirmation bias benefits group learning across a wide range of resource-scarcity conditions. Moreover, we discover that, past a critical bias strength, resource abundance favors the emergence of two different performance regimes, one of which is suboptimal. In addition, we find that this regime bifurcation comes with polarization in small groups of agents. Overall, our results suggest the existence of an optimal, moderate level of confirmation bias for decision-making in a social context.


Asunto(s)
Toma de Decisiones , Refuerzo en Psicología , Toma de Decisiones/fisiología , Humanos , Simulación por Computador , Biología Computacional , Recompensa , Sesgo , Aprendizaje/fisiología , Modelos Psicológicos
5.
Elife ; 132024 Sep 10.
Artículo en Inglés | MEDLINE | ID: mdl-39255007

RESUMEN

Previous studies on reinforcement learning have identified three prominent phenomena: (1) individuals with anxiety or depression exhibit a reduced learning rate compared to healthy subjects; (2) learning rates may increase or decrease in environments with rapidly changing (i.e. volatile) or stable feedback conditions, a phenomenon termed learning rate adaptation; and (3) reduced learning rate adaptation is associated with several psychiatric disorders. In other words, multiple learning rate parameters are needed to account for behavioral differences across participant populations and volatility contexts in this flexible learning rate (FLR) model. Here, we propose an alternative explanation, suggesting that behavioral variation across participant populations and volatile contexts arises from the use of mixed decision strategies. To test this hypothesis, we constructed a mixture-of-strategies (MOS) model and used it to analyze the behaviors of 54 healthy controls and 32 patients with anxiety and depression in volatile reversal learning tasks. Compared to the FLR model, the MOS model can reproduce the three classic phenomena by using a single set of strategy preference parameters without introducing any learning rate differences. In addition, the MOS model can successfully account for several novel behavioral patterns that cannot be explained by the FLR model. Preferences for different strategies also predict individual variations in symptom severity. These findings underscore the importance of considering mixed strategy use in human learning and decision-making and suggest atypical strategy preference as a potential mechanism for learning deficits in psychiatric disorders.


Asunto(s)
Ansiedad , Toma de Decisiones , Depresión , Humanos , Masculino , Femenino , Adulto , Toma de Decisiones/fisiología , Incertidumbre , Adulto Joven , Refuerzo en Psicología , Modelos Psicológicos , Aprendizaje Inverso/fisiología
6.
Learn Mem ; 31(8)2024 Aug.
Artículo en Inglés | MEDLINE | ID: mdl-39260876

RESUMEN

Safety signals reinforce instrumental avoidance behavior in nonhuman animals. However, there are no conclusive demonstrations of this phenomenon in humans. Using human participants in an avoidance task, Experiments 1-3 and 5 were conducted online to assess the reinforcing properties of safety signals, and Experiment 4 was conducted in the laboratory. Participants were trained with CSs+ and CSs-, and they could avoid an aversive outcome during presentations of the CSs+ by pressing their space bar at a specific time. If successful, the aversive outcome was not presented but instead a safety signal was. Participants were then tested-whilst on extinction-with two new ambiguous test CSs. If participants made an avoidance response, one of the test CSs produced the trained safety signal and the other was a control. In Experiments 1 and 4, the control was followed by no signal. In Experiment 2, the control was followed by a signal that differed in one dimension (color) with the trained safety signal, and in Experiment 3, the control differed in two dimensions (shape and color) from the trained safety signal. Experiment 5 tested the reinforcing properties of the safety signal using a choice procedure and a new response during test. We observed that participants made more avoidance responses to the ambiguous test CSs when followed by the trained signal in Experiments 1, 3, 4, and 5 (but not in Experiment 2). Overall, these results suggest that trained safety signals can reinforce avoidance behavior in humans.


Asunto(s)
Reacción de Prevención , Condicionamiento Operante , Refuerzo en Psicología , Humanos , Reacción de Prevención/fisiología , Masculino , Femenino , Adulto Joven , Adulto , Condicionamiento Operante/fisiología , Extinción Psicológica/fisiología , Adolescente
7.
Artif Intell Med ; 156: 102945, 2024 Oct.
Artículo en Inglés | MEDLINE | ID: mdl-39178622

RESUMEN

In the formulation of strategies for walking rehabilitation, achieving precise identification of the current state and making rational predictions about the future state are crucial but often unrealized. To tackle this challenge, our study introduces a unified framework that integrates a novel 3D walking motion capture method using multi-source image fusion and a walking rehabilitation simulation approach based on multi-agent reinforcement learning. We found that, (i) the proposal achieved an accurate 3D walking motion capture and outperforms other advanced methods. Experimental evidence indicates that, compared to similar visual skeleton tracking methods, the proposed approach yields results with higher Pearson correlation (r=0.93), intra-class correlation coefficient (ICC(2,1)=0.91), and narrower confidence intervals ([0.90,0.95] for r, [0.88,0.94] for ICC(2,1)) when compared to standard results. The outcomes of the proposed approach also exhibit commendable correlation and concurrence with those obtained through the IMU-based skeleton tracking method in the assessment of gait parameters ([0.85,0.89] for r, [0.75,0.81] for ICC(2,1)); (ii) multi-agent reinforcement learning has the potential to be used to solve the simulation task of gait rehabilitation. In mimicry experiment, our proposed simulation method for gait rehabilitation not only enables the intelligent agent to converge from the initial state to the target state, but also observes evolutionary patterns similar to those observed in clinical practice through motor state resolution. This study offers valuable contributions to walking rehabilitation, enabling precise assessment and simulation-based interventions, with potential implications for clinical practice and patient outcomes.


Asunto(s)
Marcha , Caminata , Humanos , Caminata/fisiología , Marcha/fisiología , Simulación por Computador , Refuerzo en Psicología , Imagenología Tridimensional/métodos , Aprendizaje Automático
8.
Nat Commun ; 15(1): 7590, 2024 Aug 31.
Artículo en Inglés | MEDLINE | ID: mdl-39217160

RESUMEN

Neural systems have evolved not only to solve environmental challenges through internal representations but also, under social constraints, to communicate these to conspecifics. In this work, we aim to understand the structure of these internal representations and how they may be optimized to transmit pertinent information from one individual to another. Thus, we build on previous teacher-student communication protocols to analyze the formation of individual and shared abstractions and their impact on task performance. We use reinforcement learning in grid-world mazes where a teacher network passes a message to a student to improve task performance. This framework allows us to relate environmental variables with individual and shared representations. We compress high-dimensional task information within a low-dimensional representational space to mimic natural language features. In coherence with previous results, we find that providing teacher information to the student leads to a higher task completion rate and an ability to generalize tasks it has not seen before. Further, optimizing message content to maximize student reward improves information encoding, suggesting that an accurate representation in the space of messages requires bi-directional input. These results highlight the role of language as a common representation among agents and its implications on generalization capabilities.


Asunto(s)
Lenguaje , Aprendizaje Social , Humanos , Refuerzo en Psicología , Aprendizaje/fisiología , Análisis y Desempeño de Tareas
9.
Med Eng Phys ; 130: 104197, 2024 08.
Artículo en Inglés | MEDLINE | ID: mdl-39160025

RESUMEN

The neural control of human quiet stance remains controversial, with classic views suggesting a limited role of the brain and recent findings conversely indicating direct cortical control of muscles during upright posture. Conceptual neural feedback control models have been proposed and tested against experimental evidence. The most renowned model is the continuous impedance control model. However, when time delays are included in this model to simulate neural transmission, the continuous controller becomes unstable. Another model, the intermittent control model, assumes that the central nervous system (CNS) activates muscles intermittently, and not continuously, to counteract gravitational torque. In this study, a delayed reinforcement learning algorithm was developed to seek optimal control policy to balance a one-segment inverted pendulum model representing the human body. According to this approach, there was no a-priori strategy imposed on the controller but rather the optimal strategy emerged from the reward-based learning. The simulation results indicated that the optimal neural controller exhibits intermittent, and not continuous, characteristics, in agreement with the possibility that the CNS intermittently provides neural feedback torque to maintain an upright posture.


Asunto(s)
Postura , Humanos , Postura/fisiología , Refuerzo en Psicología , Aprendizaje , Factores de Tiempo , Modelos Biológicos , Torque
10.
Sci Rep ; 14(1): 19759, 2024 08 26.
Artículo en Inglés | MEDLINE | ID: mdl-39187552

RESUMEN

Reinforcement learning (RL) is an effective method in training dialogue policies to steer the conversation towards successful task completion. However, most RL-based methods only rely on semantic inputs that lack empathy as they ignore the user emotional information. Moreover, these methods suffer from delayed rewards caused by the user simulator returning valuable results only at dialogue end. Recently, some methods have been proposed to learn the reward function together with user emotions, but they omit considering user emotion in each dialogue turn. In this paper, we proposed an emotion-sensitive dialogue policy model (ESDP), it incorporates user emotions information into dialogue policy and selects the optimal action by the combination of top-k actions with the user emotions. The user emotion information in each turn is used as an immediate reward for the current dialogue state to solve sparse rewards and the dependency on termination. Extensive experiments validate that our method outperforms the baseline approaches when combined with different Q-Learning algorithms, and also surpasses other popular existing dialog policies' performance.


Asunto(s)
Emociones , Humanos , Emociones/fisiología , Algoritmos , Refuerzo en Psicología , Recompensa
11.
Nat Commun ; 15(1): 7093, 2024 Aug 17.
Artículo en Inglés | MEDLINE | ID: mdl-39154025

RESUMEN

Perceptual decisions should depend on sensory evidence. However, such decisions are also influenced by past choices and outcomes. These choice history biases may reflect advantageous strategies to exploit temporal regularities of natural environments. However, it is unclear whether and how observers can adapt their choice history biases to different temporal regularities, to exploit the multitude of temporal correlations that exist in nature. Here, we show that male mice adapt their perceptual choice history biases to different temporal regularities of visual stimuli. This adaptation was slow, evolving over hundreds of trials across several days. It occurred alongside a fast non-adaptive choice history bias, limited to a few trials. Both fast and slow trial history effects are well captured by a normative reinforcement learning algorithm with multi-trial belief states, comprising both current trial sensory and previous trial memory states. We demonstrate that dorsal striatal dopamine tracks predictions of the model and behavior, suggesting that striatal dopamine reports reward predictions associated with adaptive choice history biases. Our results reveal the adaptive nature of perceptual choice history biases and shed light on their underlying computational principles and neural correlates.


Asunto(s)
Conducta de Elección , Cuerpo Estriado , Dopamina , Animales , Masculino , Dopamina/metabolismo , Ratones , Cuerpo Estriado/metabolismo , Cuerpo Estriado/fisiología , Conducta de Elección/fisiología , Ratones Endogámicos C57BL , Toma de Decisiones/fisiología , Recompensa , Estimulación Luminosa , Percepción Visual/fisiología , Refuerzo en Psicología
12.
Nat Commun ; 15(1): 6617, 2024 Aug 09.
Artículo en Inglés | MEDLINE | ID: mdl-39122687

RESUMEN

The role of serotonin in human behaviour is informed by approaches which allow in vivo modification of synaptic serotonin. However, characterising the effects of increased serotonin signalling in human models of behaviour is challenging given the limitations of available experimental probes, notably selective serotonin reuptake inhibitors. Here we use a now-accessible approach to directly increase synaptic serotonin in humans (a selective serotonin releasing agent) and examine its influence on domains of behaviour historically considered core functions of serotonin. Computational techniques, including reinforcement learning and drift diffusion modelling, explain participant behaviour at baseline and after week-long intervention. Reinforcement learning models reveal that increasing synaptic serotonin reduces sensitivity for outcomes in aversive contexts. Furthermore, increasing synaptic serotonin enhances behavioural inhibition, and shifts bias towards impulse control during exposure to aversive emotional probes. These effects are seen in the context of overall improvements in memory for neutral verbal information. Our findings highlight the direct effects of increasing synaptic serotonin on human behaviour, underlining its role in guiding decision-making within aversive and more neutral contexts, and offering implications for longstanding theories of central serotonin function.


Asunto(s)
Serotonina , Humanos , Serotonina/metabolismo , Masculino , Femenino , Adulto , Adulto Joven , Refuerzo en Psicología , Reacción de Prevención/efectos de los fármacos , Reacción de Prevención/fisiología , Emociones/fisiología , Inhibición Psicológica , Inhibidores Selectivos de la Recaptación de Serotonina/farmacología , Aprendizaje/fisiología , Toma de Decisiones/fisiología , Memoria/fisiología , Memoria/efectos de los fármacos
13.
Proc Biol Sci ; 291(2028): 20241141, 2024 Aug.
Artículo en Inglés | MEDLINE | ID: mdl-39110908

RESUMEN

Learning is a taxonomically widespread process by which animals change their behavioural responses to stimuli as a result of experience. In this way, it plays a crucial role in the development of individual behaviour and underpins substantial phenotypic variation within populations. Nevertheless, the impact of learning in social contexts on evolutionary change is not well understood. Here, we develop game theoretical models of competition for resources in small groups (e.g. producer-scrounger and hawk-dove games) in which actions are controlled by reinforcement learning and show that biases in the subjective valuation of different actions readily evolve. Moreover, in many cases, the convergence stable levels of bias exist at fitness minima and therefore lead to disruptive selection on learning rules and, potentially, to the evolution of genetic polymorphisms. Thus, we show how reinforcement learning in social contexts can be a driver of evolutionary diversification. In addition, we consider the evolution of ability in our games, showing that learning can also drive disruptive selection on the ability to perform a task.


Asunto(s)
Evolución Biológica , Conducta Competitiva , Teoría del Juego , Aprendizaje , Animales , Refuerzo en Psicología
14.
Cereb Cortex ; 34(8)2024 Aug 01.
Artículo en Inglés | MEDLINE | ID: mdl-39118215

RESUMEN

Freedom of choice enhances our sense of agency. During goal-directed behavior, the freedom to choose between different response options increases the neural processing of positive and negative feedback, indicating enhanced outcome monitoring under conditions of high agency experience. However, it is unclear whether this enhancement is predominantly driven by an increased salience of self- compared to externally determined action outcomes or whether differences in the perceived instrumental value of outcomes contribute to outcome monitoring in goal-directed tasks. To test this, we recorded electroencephalography while participants performed a reinforcement learning task involving free choices, action-relevant forced choices, and action-irrelevant forced choices. We observed larger midfrontal theta power and N100 amplitudes for feedback following free choices compared with action-relevant and action-irrelevant forced choices. In addition, a Reward Positivity was only present for free but not forced choice outcomes. Crucially, our results indicate that enhanced outcome processing is not driven by the relevance of outcomes for future actions but rather stems from the association of outcomes with recent self-determined choice. Our findings highlight the pivotal role of self-determination in tracking the consequences of our actions and contribute to an understanding of the cognitive processes underlying the choice-induced facilitation in outcome monitoring.


Asunto(s)
Conducta de Elección , Electroencefalografía , Autonomía Personal , Humanos , Masculino , Femenino , Conducta de Elección/fisiología , Adulto Joven , Adulto , Recompensa , Potenciales Evocados/fisiología , Encéfalo/fisiología , Aprendizaje/fisiología , Refuerzo en Psicología , Ritmo Teta/fisiología
15.
Addict Biol ; 29(8): e13429, 2024 Aug.
Artículo en Inglés | MEDLINE | ID: mdl-39109814

RESUMEN

The endocannabinoid system interacts with the reward system to modulate responsiveness to natural reinforcers, as well as drugs of abuse. Previous preclinical studies suggested that direct blockade of CB1 cannabinoid receptors (CB1R) could be leveraged as a potential pharmacological approach to treat substance use disorder, but this strategy failed during clinical trials due to severe psychiatric side effects. Alternative strategies have emerged to circumvent the side effects of direct CB1 binding through the development of allosteric modulators. We hypothesized that negative allosteric modulation of CB1R signalling would reduce the reinforcing properties of morphine and decrease behaviours associated with opioid misuse. By employing intravenous self-administration in mice, we studied the effects of GAT358, a functionally-biased CB1R negative allosteric modulator (NAM), on morphine intake, relapse-like behaviour and motivation to work for morphine infusions. GAT358 reduced morphine infusion intake during the maintenance phase of morphine self-administration under a fixed ratio 1 schedule of reinforcement. GAT358 also decreased morphine-seeking behaviour after forced abstinence. Moreover, GAT358 dose dependently decreased the motivation to obtain morphine infusions under a progressive ratio schedule of reinforcement. Strikingly, GAT358 did not affect the motivation to work for food rewards in an identical progressive ratio task, suggesting that the effect of GAT358 in decreasing opioid self-administration was reward specific. Furthermore, GAT58 did not produce motor ataxia in the rotarod test. Our results suggest that CB1R NAMs reduced the reinforcing properties of morphine and could represent a viable therapeutic route to safely decrease misuse of opioids.


Asunto(s)
Morfina , Receptor Cannabinoide CB1 , Autoadministración , Animales , Morfina/farmacología , Morfina/administración & dosificación , Receptor Cannabinoide CB1/efectos de los fármacos , Ratones , Regulación Alostérica/efectos de los fármacos , Masculino , Comportamiento de Búsqueda de Drogas/efectos de los fármacos , Recurrencia , Refuerzo en Psicología , Motivación/efectos de los fármacos , Analgésicos Opioides/farmacología , Analgésicos Opioides/administración & dosificación , Administración Intravenosa , Condicionamiento Operante/efectos de los fármacos , Transducción de Señal/efectos de los fármacos
16.
Neural Comput ; 36(9): 1854-1885, 2024 Aug 19.
Artículo en Inglés | MEDLINE | ID: mdl-39106455

RESUMEN

In reinforcement learning (RL), artificial agents are trained to maximize numerical rewards by performing tasks. Exploration is essential in RL because agents must discover information before exploiting it. Two rewards encouraging efficient exploration are the entropy of action policy and curiosity for information gain. Entropy is well established in the literature, promoting randomized action selection. Curiosity is defined in a broad variety of ways in literature, promoting discovery of novel experiences. One example, prediction error curiosity, rewards agents for discovering observations they cannot accurately predict. However, such agents may be distracted by unpredictable observational noises known as curiosity traps. Based on the free energy principle (FEP), this letter proposes hidden state curiosity, which rewards agents by the KL divergence between the predictive prior and posterior probabilities of latent variables. We trained six types of agents to navigate mazes: baseline agents without rewards for entropy or curiosity and agents rewarded for entropy and/or either prediction error curiosity or hidden state curiosity. We find that entropy and curiosity result in efficient exploration, especially both employed together. Notably, agents with hidden state curiosity demonstrate resilience against curiosity traps, which hinder agents with prediction error curiosity. This suggests implementing the FEP that may enhance the robustness and generalization of RL models, potentially aligning the learning processes of artificial and biological agents.


Asunto(s)
Conducta Exploratoria , Refuerzo en Psicología , Recompensa , Conducta Exploratoria/fisiología , Humanos , Entropía , Simulación por Computador
17.
PLoS One ; 19(8): e0307211, 2024.
Artículo en Inglés | MEDLINE | ID: mdl-39172969

RESUMEN

Modern optical systems are important components of contemporary electronics and communication technologies, and the design of new systems has led to many innovative breakthroughs. This paper introduces a novel application based on deep reinforcement learning, D3QN, which is a combination of the Dueling Architecture and Double Q-Network methods, to design distributed Bragg reflectors (DBRs). Traditional design methods are based on time-consuming iterative simulations, whereas D3QN is designed to optimize the multilayer structure of DBRs. This approach enabled the reflectance performance and compactness of the DBRs to be improved. The reflectance of the DBRs designed using D3QN is 20.5% higher compared to designs derived from the transfer matrix method (TMM), and these DBRs are 61.2% smaller in terms of their size. These advancements suggest that deep reinforcement learning, specifically the D3QN methodology, is a promising new method for optical design and is more efficient than traditional techniques. Future research possibilities include expansion to 2D and 3D design structures, where increased design complexities could likely be addressed using D3QN or similar innovative solutions.


Asunto(s)
Diseño de Equipo , Aprendizaje Profundo , Refuerzo en Psicología
18.
Neural Netw ; 179: 106552, 2024 Nov.
Artículo en Inglés | MEDLINE | ID: mdl-39089154

RESUMEN

Multi-agent reinforcement learning (MARL) effectively improves the learning speed of agents in sparse reward tasks with the guide of subgoals. However, existing works sever the consistency of the learning objectives of the subgoal generation and subgoal reached stages, thereby significantly inhibiting the effectiveness of subgoal learning. To address this problem, we propose a novel Potential field Subgoal-based Multi-Agent reinforcement learning (PSMA) method, which introduces the potential field (PF) to unify the two-stage learning objectives. Specifically, we design a state-to-PF representation model that describes agents' states as potential fields, allowing easy measurement of the interaction effect for both allied and enemy agents. With the PF representation, a subgoal selector is designed to automatically generate multiple subgoals for each agent, drawn from the experience replay buffer that contains both individual and total PF values. Based on the determined subgoals, we define an intrinsic reward function to guide the agent to reach their respective subgoals while maximizing the joint action-value. Experimental results show that our method outperforms the state-of-the-art MARL method on both StarCraft II micro-management (SMAC) and Google Research Football (GRF) tasks with sparse reward settings.


Asunto(s)
Refuerzo en Psicología , Recompensa , Redes Neurales de la Computación , Humanos , Algoritmos , Aprendizaje Automático
19.
Neural Netw ; 179: 106543, 2024 Nov.
Artículo en Inglés | MEDLINE | ID: mdl-39089158

RESUMEN

Recent successes in robot learning have significantly enhanced autonomous systems across a wide range of tasks. However, they are prone to generate similar or the same solutions, limiting the controllability of the robot to behave according to user intentions. These limited robot behaviors may lead to collisions and potential harm to humans. To resolve these limitations, we introduce a semi-autonomous teleoperation framework that enables users to operate a robot by selecting a high-level command, referred to as option. Our approach aims to provide effective and diverse options by a learned policy, thereby enhancing the efficiency of the proposed framework. In this work, we propose a quality-diversity (QD) based sampling method that simultaneously optimizes both the quality and diversity of options using reinforcement learning (RL). Additionally, we present a mixture of latent variable models to learn multiple policy distributions defined as options. In experiments, we show that the proposed method achieves superior performance in terms of the success rate and diversity of the options in simulation environments. We further demonstrate that our method outperforms manual keyboard control for time duration over cluttered real-world environments.


Asunto(s)
Refuerzo en Psicología , Robótica , Robótica/métodos , Humanos , Aprendizaje Automático , Simulación por Computador , Algoritmos , Redes Neurales de la Computación
20.
Neural Netw ; 179: 106579, 2024 Nov.
Artículo en Inglés | MEDLINE | ID: mdl-39096749

RESUMEN

How to accurately learn task-relevant state representations from high-dimensional observations with visual distractions is a realistic and challenging problem in visual reinforcement learning. Recently, unsupervised representation learning methods based on bisimulation metrics, contrast, prediction, and reconstruction have shown the ability for task-relevant information extraction. However, due to the lack of appropriate mechanisms for the extraction of task information in the prediction, contrast, and reconstruction-related approaches and the limitations of bisimulation-related methods in domains with sparse rewards, it is still difficult for these methods to be effectively extended to environments with distractions. To alleviate these problems, in the paper, the action sequences, which contain task-intensive signals, are incorporated into representation learning. Specifically, we propose a Sequential Action-induced invariant Representation (SAR) method, which decouples the controlled part (i.e., task-relevant information) and the uncontrolled part (i.e., task-irrelevant information) in noisy observations through sequential actions, thereby extracting effective representations related to decision tasks. To achieve it, the characteristic function of the action sequence's probability distribution is modeled to specifically optimize the state encoder. We conduct extensive experiments on the distracting DeepMind Control suite while achieving the best performance over strong baselines. We also demonstrate the effectiveness of our method at disregarding task-irrelevant information by applying SAR to real-world CARLA-based autonomous driving with natural distractions. Finally, we provide the analysis results of generalization drawn from the generalization decay and t-SNE visualization. Code and demo videos are available at https://github.com/DMU-XMU/SAR.git.


Asunto(s)
Refuerzo en Psicología , Humanos , Redes Neurales de la Computación , Algoritmos
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA