Differentiable neural computer

From Infogalactic: the planetary knowledge core
Jump to: navigation, search
File:DNC training recall task.gif
A differentiable neural computer being trained to store and recall dense binary numbers. Performance of a reference task during training shown. Upper left: the input (red) and target (blue), as 5-bit words and a 1 bit interrupt signal. Upper right: the model's output.

In artificial intelligence, a differentiable neural computer (DNC) is a memory augmented neural network architecture (MANN), which is typically (but not by definition) recurrent in its implementation. The model was published in 2016 by Alex Graves et al. of DeepMind.[1]

Applications

DNC indirectly takes inspiration from Von-Neumann architecture, making it likely to outperform conventional architectures in tasks that are fundamentally algorithmic that cannot be learned by finding a decision boundary.

So far, DNCs have been demonstrated to handle only relatively simple tasks, which can be solved using conventional programming. But DNCs don't need to be programmed for each problem, but can instead be trained. This attention span allows the user to feed complex data structures such as graphs sequentially, and recall them for later use. Furthermore, they can learn aspects of symbolic reasoning and apply it to working memory. The researchers who published the method see promise that DNCs can be trained to perform complex, structured tasks[1][2] and address big-data applications that require some sort of reasoning, such as generating video commentaries or semantic text analysis.[3][4]

DNC can be trained to navigate rapid transit systems, and apply that network to a different system. A neural network without memory would typically have to learn about each transit system from scratch. On graph traversal and sequence-processing tasks with supervised learning, DNCs performed better than alternatives such as long short-term memory or a neural turing machine.[5] With a reinforcement learning approach to a block puzzle problem inspired by SHRDLU, DNC was trained via curriculum learning, and learned to make a plan. It performed better than a traditional recurrent neural network.[5]

Architecture

DNC networks were introduced as an extension of the Neural Turing Machine (NTM), with the addition of memory attention mechanisms that control where the memory is stored, and temporal attention that records the order of events. This structure allows DNCs to be more robust and abstract than a NTM, and still perform tasks that have longer-term dependencies than some predecessors such as Long Short Term Memory (LSTM). The memory, which is simply a matrix, can be allocated dynamically and accessed indefinitely. The DNC is differentiable end-to-end (each subcomponent of the model is differentiable, therefore so is the whole model). This makes it possible to optimize them efficiently using gradient descent.[3][6][7]

The DNC model is similar to the Von Neumann architecture, and because of the resizability of memory, it is Turing complete.[8]

Traditional DNC

Lua error in package.lua at line 80: module 'strict' not found.

DNC, as originally published[1]

Independent variables
\mathbf{x}_t Input vector
\mathbf{z}_t Target vector
Controller
\boldsymbol\chi_t = [\mathbf{x}_t; \mathbf{r}_{t-1}^1; \cdots; \mathbf{r}_{t-1}^R] Controller input matrix


Deep (layered) LSTM \forall\;0\leq l\leq L
\mathbf{i}_t^l = \sigma(W_{i}^l [\boldsymbol\chi_t; \mathbf{h}_{t-1}^l; \mathbf{h}_t^{l-1}] + \mathbf{b}_i^l) Input gate vector
\mathbf{o}_t^l = \sigma(W_{o}^l [\boldsymbol\chi_t; \mathbf{h}_{t-1}^l; \mathbf{h}_t^{l-1}] + \mathbf{b}_o^l) Output gate vector
\mathbf{f}_t^l = \sigma(W_{f}^l [\boldsymbol\chi_t; \mathbf{h}_{t-1}^l; \mathbf{h}_t^{l-1}] + \mathbf{b}_f^l) Forget gate vector
\mathbf{s}_t^l = \mathbf{f}_t^l \mathbf{s}_{t-1}^l + \mathbf{i}_t^l\tanh(W_{s}^l [\boldsymbol\chi_t; \mathbf{h}_{t-1}^l; \mathbf{h}_t^{l-1}] + \mathbf{b}_s^l) State gate vector,
s_0 = 0
\mathbf{h}_t^l = \mathbf{o}_t^l \tanh(\mathbf{s}_t^l) Hidden gate vector,
h_0=0; h_t^0=0\;\forall\;t


\mathbf{y}_t=W_y[\mathbf{h}_t^1;\cdots;\mathbf{h}_t^L]+W_r[\mathbf{r}_t^1;\cdots;\mathbf{r}_t^R] DNC output vector
Read & Write heads
\xi_t = W_\xi[h_t^1;\cdots;h_t^L] Interface parameters
=[\mathbf{k}_t^{r,1};\cdots;\mathbf{k}_t^{r,R};\hat{\beta}_t^{r,1};\cdots;\hat{\beta}_t^{r,R};\mathbf{k}_t^w;\hat{\beta_t^w};\mathbf{\hat{e}}_t;\mathbf{v}_t;\hat{f_t^1};\cdots;\hat{f_t^R};\hat{g}_t^a;\hat{g}_t^w;\hat{\boldsymbol\pi}_t^1;\cdots;\hat{\boldsymbol\pi}_t^R]


Read heads \forall\;1\leq i\leq R
\mathbf{k}_t^{r,i} Read keys
Failed to parse (Missing <code>texvc</code> executable. Please see math/README to configure.): \beta_t^{r,i}=\text{oneplus}(\hat{\beta}_t^{r,i}) Read strengths
f_t^i=\sigma(\hat{f}_t^i) Free gates
Failed to parse (Missing <code>texvc</code> executable. Please see math/README to configure.): \boldsymbol\pi_t^i=\text{softmax}(\hat{\boldsymbol\pi}_t^i) Read modes,
\boldsymbol\pi_t^i\in\mathbb{R}^3


Write head
\mathbf{k}_t^w Write key
\beta_t^w=\hat{\beta}_t^w Write strength
\mathbf{e}_t=\sigma(\mathbf{\hat{e}}_t) Erase vector
\mathbf{v}_t Write vector
g_t^a=\sigma(\hat{g}_t^a) Allocation gate
g_t^w=\sigma(\hat{g}_t^w) Write gate
Memory
Failed to parse (Missing <code>texvc</code> executable. Please see math/README to configure.): M_t=M_{t-1}\circ(E-\mathbf{w}_t^w\mathbf{e}_t^\intercal)+\mathbf{w}_t^w\mathbf{v}_t^\intercal Memory matrix,
Matrix of ones E\in\mathbb{R}^{N\times W}
\mathbf{u}_t=(\mathbf{u}_{t-1}+\mathbf{w}_{t-1}^w-\mathbf{u}_{t-1}\circ\mathbf{w}_{t-1}^w)\circ\boldsymbol\psi_t Usage vector
\mathbf{p}_t=\left(1-\sum_i\mathbf{w}_t^w[i]\right)\mathbf{p}_{t-1}+\mathbf{w}_t^w Precedence weighting,
\mathbf{p}_0=\mathbf{0}
L_t=(\mathbf{1} - \mathbf{I})\left[(1-\mathbf{w}_t^w[i]-\mathbf{w}_t^j)L_{t-1}[i,j]+\mathbf{w}_t^w[i]\mathbf{p}_{t-1}^j\right] Temporal link matrix,
L_0=\mathbf{0}
\mathbf{w}_t^w=g_t^w[g_t^a\mathbf{a}_t+(1-g_t^a)\mathbf{c}_t^w] Write weighting
\mathbf{w}_t^{r,i}=\boldsymbol\pi_t^i[1]\mathbf{b}_t^i+\boldsymbol\pi_t^i[2]c_t^{r,i}+\boldsymbol\pi_t^i[3]f_t^i Read weighting
Failed to parse (Missing <code>texvc</code> executable. Please see math/README to configure.): \mathbf{r}_t^i=M_t^\intercal\mathbf{w}_t^{r,i} Read vectors


\mathcal{C}(M,\mathbf{k},\beta)[i]=\frac{\exp\{\mathcal{D}(\mathbf{k},M[i,\cdot])\beta\}}{\sum_j\exp\{\mathcal{D}(\mathbf{k},M[j,\cdot])\beta\}} Content-based addressing,
Lookup key \mathbf{k}, key strength \beta
\phi_t Indices of \mathbf{u}_t,
sorted in ascending order of usage
\mathbf{a}_t[\phi_t[j]]=(1-\mathbf{u}_t[\phi_t[j]])\prod_{i=1}^{j-1}\mathbf{u}_t[\phi_t[i]] Allocation weighting
\mathbf{c}_t^w=\mathcal{C}(M_{t-1},\mathbf{k}_t^w,\beta_t^w) Write content weighting
\mathbf{c}_t^{r,i}=\mathcal{C}(M_{t-1},\mathbf{k}_t^{r,i},\beta_t^{r,i}) Read content weighting
\mathbf{f}_t^i=L_t\mathbf{w}_{t-1}^{r,i} Forward weighting
Failed to parse (Missing <code>texvc</code> executable. Please see math/README to configure.): \mathbf{b}_t^i=L_t^\intercal\mathbf{w}_{t-1}^{r,i} Backward weighting
\boldsymbol\psi_t=\prod_{i=1}^R\left(\mathbf{1}-f_t^i\mathbf{w}_{t-1}^{r,i}\right) Memory retention vector
Definitions
\mathbf{W},\mathbf{b} Weight matrix, bias vector
\mathbf{0},\mathbf{1},\mathbf{I} Zeros matrix, ones matrix, identity matrix
\circ Element-wise multiplication
\mathcal{D}(\mathbf{u},\mathbf{v})=\frac{\mathbf{u}\cdot\mathbf{v}}{\|\mathbf{u}\|\|\mathbf{v}\|} Cosine similarity
\sigma(x)=1/(1+e^{-x}) Sigmoid function
Failed to parse (Missing <code>texvc</code> executable. Please see math/README to configure.): \text{oneplus}(x)=1+\log(1+e^x) Oneplus function
Failed to parse (Missing <code>texvc</code> executable. Please see math/README to configure.): \text{softmax}(\mathbf{x})_j = \frac{e^{x_j}}{\sum_{k=1}^K e^{x_k}}
   for j = 1, …, K.
Softmax function

Extensions

Refinements include sparse memory addressing, which reduces time and space complexity by thousands of times. This can be achieved by using an approximate nearest neighbor algorithm, such as Locality-sensitive hashing, or a random k-d tree like Fast Library for Approximate Nearest Neighbors from UBC.[9] Adding Adaptive Computation Time (ACT) separates computation time from data time, which uses the fact that problem length and problem difficulty are not always the same.[10] Training using synthetic gradients performs considerably better than Backpropagation through time (BPTT).[11] Robustness can be improved with use of layer normalization and Bypass Dropout as regularization.[12]

See also

References

<templatestyles src="Reflist/styles.css" />

Cite error: Invalid <references> tag; parameter "group" is allowed only.

Use <references />, or <references group="..." />

External links

  1. 1.0 1.1 1.2 Lua error in package.lua at line 80: module 'strict' not found.
  2. Lua error in package.lua at line 80: module 'strict' not found.
  3. 3.0 3.1 Lua error in package.lua at line 80: module 'strict' not found.
  4. Lua error in package.lua at line 80: module 'strict' not found.
  5. 5.0 5.1 Lua error in package.lua at line 80: module 'strict' not found.
  6. Lua error in package.lua at line 80: module 'strict' not found.
  7. Lua error in package.lua at line 80: module 'strict' not found.
  8. Lua error in package.lua at line 80: module 'strict' not found.
  9. Lua error in package.lua at line 80: module 'strict' not found.
  10. Lua error in package.lua at line 80: module 'strict' not found.
  11. Lua error in package.lua at line 80: module 'strict' not found.
  12. Lua error in package.lua at line 80: module 'strict' not found.