动手学深度学习2.3线性代数-笔记&操练（PyTorch）

感情走私 · 发表于 2023-8-19 09:51:17

以下内容为结合李沐老师的课程和教材补充的学习笔记，以及对课后操练的一些思考，自留回顾，也供同学之人交流参考。

本节课程地址：线性代数_哔哩哔哩_bilibili
本节教材地址：2.3. 线性代数 — 动手学深度学习 2.0.0 documentation (d2l.ai)
本节开源代码：...>d2l-zh>pytorch>chapter_preliminaries>linear-algebra
<hr/>线性代数

在介绍完如何存储和操作数据后，接下来将简要地回顾一下部门基本线性代数内容。这些内容有助于读者了解和实现本书中介绍的大大都模型。本节将介绍线性代数中的基本数学对象、算术和运算，并用数学符号和相应的代码实现来暗示它们。
标量

如果你曾经在餐厅付出餐费，那么应该已经知道一些基本的线性代数，比如在数字间相加或相乘。例如，北京的温度为52℉（华氏度，除摄氏度外的另一种温度计量单元）。严格来说，仅包含一个数值被称为标量（scalar）。如果要将此华氏度值转换为更常用的摄氏度，则可以计算表达式 c = \frac{5}{9} (f - 32) ，并将f赋为52。在此等式中，每一项（5、9和32）都是标量值。符号c和f称为变量（variable），它们暗示未知的标量值。
本书采用了数学暗示法，此中标量变量由普通小写字母暗示（例如，x、y和z）。本书用R暗示所有（持续）实数标量的空间，之后将严格定义空间（space）是什么，但此刻只要记住表达式x ∈ R是暗示x是一个实值标量的正式形式。符号∈称为“属于”，它暗示“是调集中的成员”。例如x, y ∈ {0,1}可以用来表白x和y是值只能为0或1的数字。
(标量由只有一个元素的张量暗示)。下面的代码将实例化两个标量，并执行一些熟悉的算术运算，即加法、乘法、除法和指数。
import torch

x = torch.tensor(3.0)
y = torch.tensor(2.0)

x + y, x * y, x / y, x**y
(tensor(5.), tensor(6.), tensor(1.5000), tensor(9.))向量

[向量可以被视为标量值组成的列表]。这些标量值被称为向量的元素（element）或分量（component）。当向量暗示数据集中的样本时，它们的值具有必然的现实意义。例如，如果我们正在训练一个模型来预测贷款违约风险，可能会将每个申请人与一个向量相关联，其分量与其收入、工作年限、过往违约次数和其他因素相对应。如果我们正在研究病院患者可能面临的心脏病爆发风险，可能会用一个向量来暗示每个患者，其分量为比来的生命体征、胆固醇程度、每天运动时间等。在数学暗示法中，向量凡是记为粗体、小写的符号（例如，x、y和z）。
人们通过一维张量暗示向量。一般来说，张量可以具有任意长度，取决于机器的内存限制。
x = torch.arange(4)
x
tensor([0, 1, 2, 3])我们可以使用下标来引用向量的任一元素，例如可以通过 x_{i} 来引用第i个元素。注意，元素 x_{i} 是一个标量，所以我们在引用它时不会加粗。大量文献认为列向量是向量的默认标的目的，在本书中也是如此。在数学中，向量x可以写为：
\mathbf{x} =\begin{bmatrix}x_{1}  \\x_{2}  \\ \vdots  \\x_{n}\end{bmatrix},
此中 x_{1}, …, x_{n} 是向量的元素。在代码中，我们(通过张量的索引来访谒任一元素)。
x[3]
tensor(3)长度、维度和形状

向量只是一个数字数组，就像每个数组都有一个长度一样，每个向量也是如此。在数学暗示法中，如果我们想说一个向量x由n个实值标量组成，可以将其暗示为x ∈ R^{n}。向量的长度凡是称为向量的维度（dimension）。
与普通的Python数组一样，我们可以通过调用Python的内置len()函数来[访谒张量的长度]。
len(x) # 返回一个标量
4当用张量暗示一个向量（只有一个轴）时，我们也可以通过.shape属性访谒向量的长度。形状（shape）是一个元素组，列出了张量沿每个轴的长度（维数）。对于(只有一个轴的张量，形状只有一个元素。)
x.shape # 返回只有一个元素的列表
torch.Size([4])请注意，维度（dimension）这个词在分歧上下文时往往会有分歧的含义，这经常会使人感到猜疑。为了清楚起见，我们在此明确一下：向量或轴的维度被用来暗示向量或轴的长度，即向量或轴的元素数量。然而，张量的维度用来暗示张量具有的轴数。在这个意义上，张量的某个轴的维数就是这个轴的长度。
矩阵

正如向量将标量从零阶推广到一阶，矩阵将向量从一阶推广到二阶。矩阵，我们凡是用粗体、大写字母来暗示（例如，X、Y和Z），在代码中暗示为具有两个轴的张量。
数学暗示法使用A ∈ R^{m\times n} 来暗示矩阵A，其由m行和n列的实值标量组成。我们可以将任意矩阵A ∈ R^{m\times n}视为一个表格，此中每个元素 a_{ij} 属于第i行第j列：
\mathbf{A}=\begin{bmatrix} a_{11} & a_{12} & \cdots & a_{1n} \\ a_{21} & a_{22} & \cdots & a_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ a_{m1} & a_{m2} & \cdots & a_{mn} \\ \end{bmatrix}.
对于任意A ∈ R^{m\times n}， A的形状是（m,n）或m × n。当矩阵具有不异数量的行和列时，其形状将变为正方形；因此，它被称为方阵（square matrix）。
当调用函数来实例化张量时，我们可以[通过指定两个分量m和n来创建一个形状为m × n的矩阵]。
A = torch.arange(20).reshape(5, 4)
A
tensor([[ 0,  1,  2,  3],
      [ 4,  5,  6,  7],
      [ 8,  9, 10, 11],
      [12, 13, 14, 15],
      [16, 17, 18, 19]])我们可以通过行索引（i）和列索引（j）来访谒矩阵中的标量元素a_{ij}，例如 [A]_{ij} 。如果没有给出矩阵A的标量元素，如在 :eqref:eq_matrix_def那样，我们可以简单地使用矩阵A的小写字母索引下标素a_{ij}来引用[A]_{ij}。为了暗示起来简单，只有在必要时才会将逗号插入到单独的索引中，例如 a_{2, 3j} 和 [A]_{2i-1, 3}。
当我们交换矩阵的行和列时，成果称为矩阵的转置（transpose）。凡是用 \mathbf{a}^\top 来暗示矩阵的转置，如果 \mathbf{B}=\mathbf{A}^\top ，则对于任意i和j，都有 b_{ij}=a_{ji} 。因此，在 :eqref:eq_matrix_def中的转置是一个形状为n × m的矩阵：
\mathbf{A}^\top = \begin{bmatrix}    a_{11} & a_{21} & \dots  & a_{m1} \\    a_{12} & a_{22} & \dots  & a_{m2} \\    \vdots & \vdots & \ddots  & \vdots \\    a_{1n} & a_{2n} & \dots  & a_{mn} \end{bmatrix}.
此刻在代码中访谒(矩阵的转置)。
A.T
tensor([[ 0,  4,  8, 12, 16],
      [ 1,  5,  9, 13, 17],
      [ 2,  6, 10, 14, 18],
      [ 3,  7, 11, 15, 19]])作为方阵的一种特殊类型，[对称矩阵（symmetric matrix）\mathbf{A} 等于其转置： \mathbf{A} = \mathbf{A}^\top ]。这里定义一个对称矩阵 \mathbf{B} ：
B = torch.tensor([[1, 2, 3], [2, 0, 4], [3, 4, 5]])
B
tensor([[1, 2, 3],
      [2, 0, 4],
      [3, 4, 5]])此刻我们将B与它的转置进行斗劲。
B == B.T
tensor([[True, True, True],
      [True, True, True],
      [True, True, True]])矩阵是有用的数据布局：它们允许我们组织具有分歧模式的数据。例如，我们矩阵中的行可能对应于分歧的房屋（数据样本），而列可能对应于分歧的属性。曾经使用过电子表格软件或已阅读过 :numref:sec_pandas的人，应该对此很熟悉。因此，尽管单个向量的默认标的目的是列向量，但在暗示表格数据集的矩阵中，将每个数据样本作为矩阵中的行向量更为常见。后面的章节将讲到这点，这种约定将撑持常见的深度学习实践。例如，沿着张量的最外轴，我们可以访谒或遍历小批量的数据样本。
张量

[就像向量是标量的推广，矩阵是向量的推广一样，我们可以构建具有更多轴的数据布局]。张量（本小节中的“张量”指代数对象）是描述具有任意数量轴的n维数组的通用方式。例如，向量是一阶张量，矩阵是二阶张量。张量用特殊字体的大写字母暗示（例如， \mathbf{X} 、 \mathsf{Y} 和 \mathsf{Z} ），它们的索引机制（例如 x_{ijk} 和 [\mathsf{X}]_{1,2i-1,3} ）与矩阵类似。
当我们开始措置图像时，张量将变得更加重要，图像以n维数组形式呈现，此中3个轴对应于高度、宽度，以及一个通道（channel）轴，用于暗示颜色通道（红色、绿色和蓝色）。此刻先将高阶张量暂放一边，而是专注学习其基础常识。
X = torch.arange(24).reshape(2, 3, 4)
# 三维张量，2个3×4矩阵，分袂对应图像的通道、高度和宽度
X
tensor([[[ 0,  1,  2,  3],
      [ 4,  5,  6,  7],
      [ 8,  9, 10, 11]],

      [[12, 13, 14, 15],
      [16, 17, 18, 19],
      [20, 21, 22, 23]]])张量算法的基赋性质

标量、向量、矩阵和任意数量轴的张量（本小节中的“张量”指代数对象）有一些实用的属性。例如，从按元素操作的定义中可以注意到，任何按元素的一元运算都不会改变其操作数的形状。同样，[给定具有不异形状的任意两个张量，任何按元素二元运算的成果都将是不异形状的张量]。例如，将两个不异形状的矩阵相加，会在这两个矩阵上执行元素加法。
A = torch.arange(20, dtype=torch.float32).reshape(5, 4)
B = A.clone()  # 通过分配新内存，将A的一个副本分配给B，B和A无关
A, A + B
(tensor([[ 0.,  1.,  2.,  3.],
      [ 4.,  5.,  6.,  7.],
      [ 8.,  9., 10., 11.],
      [12., 13., 14., 15.],
      [16., 17., 18., 19.]]),
tensor([[ 0.,  2.,  4.,  6.],
      [ 8., 10., 12., 14.],
      [16., 18., 20., 22.],
      [24., 26., 28., 30.],
      [32., 34., 36., 38.]]))具体而言，[两个矩阵的按元素乘法称为Hadamard积（Hadamard product）（数学符号 \odot ）]。对于矩阵 \mathbf{B} \in \mathbb{R}^{m \times n}，此中第i行和第j列的元素是 b_{ij} 。矩阵 \mathbf{A} （在 :eqref:eq_matrix_def中定义）和 \mathbf{B} 的Hadamard积为：  \mathbf{A} \odot \mathbf{B} = \begin{bmatrix}    a_{11}  b_{11} & a_{12}  b_{12} & \dots  & a_{1n}  b_{1n} \\    a_{21}  b_{21} & a_{22}  b_{22} & \dots  & a_{2n}  b_{2n} \\    \vdots & \vdots & \ddots & \vdots \\    a_{m1}  b_{m1} & a_{m2}  b_{m2} & \dots  & a_{mn}  b_{mn} \end{bmatrix}.
A * B
tensor([[  0., 1., 4., 9.],
      [ 16.,  25.,  36.,  49.],
      [ 64.,  81., 100., 121.],
      [144., 169., 196., 225.],
      [256., 289., 324., 361.]])将张量乘以或加上一个标量不会改变张量的形状，此中张量的每个元素都将与标量相加或相乘。
a = 2
X = torch.arange(24).reshape(2, 3, 4)
a + X, (a * X).shape
(tensor([[[ 2,  3,  4,  5],
      [ 6,  7,  8,  9],
      [10, 11, 12, 13]],

      [[14, 15, 16, 17],
      [18, 19, 20, 21],
      [22, 23, 24, 25]]]),
torch.Size([2, 3, 4]))降维

我们可以对任意张量进行的一个有用的操作是[计算其元素的和]。数学暗示法使用 \sum 符号暗示求和。为了暗示长度为$d$的向量中元素的总和，可以记为 \sum_{i=1}^dx_i 。在代码中可以调用计算求和的函数：
x = torch.arange(4, dtype=torch.float32)
x, x.sum() # 返回一个标量
(tensor([0., 1., 2., 3.]), tensor(6.))我们可以(暗示任意形状张量的元素和)。例如，矩阵 \mathbf{A} 中元素的和可以记为 \sum_{i=1}^{m} \sum_{j=1}^{n} a_{ij} 。
A.shape, A.sum()
(torch.Size([5, 4]), tensor(190.))默认情况下，调用求和函数会沿所有的轴降低张量的维度，使它变为一个标量。我们还可以[指定张量沿哪一个轴来通过求和降低维度]。以矩阵为例，为了通过求和所有行的元素来降维（轴0），可以在调用函数时指定axis=0。由于输入矩阵沿0轴降维以生成输出向量，因此输入轴0的维数在输出形状中消掉。
A = torch.arange(20, dtype=torch.float32).reshape(5, 4)
# 此处数据类型必需为浮点型，后面才可以计算A.mean()
A_sum_axis0 = A.sum(axis=0)
A_sum_axis0, A_sum_axis0.shape
# 按第一个维度求和，也即将第一个维度降为0，下同
(tensor([40., 45., 50., 55.]), torch.Size([4]))指定axis=1将通过汇总所有列的元素降维（轴1）。因此，输入轴1的维数在输出形状中消掉。
A_sum_axis1 = A.sum(axis=1)
A_sum_axis1, A_sum_axis1.shape
(tensor([ 6., 22., 38., 54., 70.]), torch.Size([5]))沿着行和列对矩阵求和，等价于对矩阵的所有元素进行求和。
A.sum(axis=[0, 1]).shape # 将所有维度降为0，成果和A.sum()不异
# 也即在不指定维度的情况下，求和意味着对该张量的所有维度同时降维到0
torch.Size([])<hr/>Note:
axis = i, 即沿第i个下标变化的标的目的进行操作。
<hr/>[一个与求和相关的量是平均值（mean或average）]。我们通过将总和除以元素总数来计算平均值。在代码中，我们可以调用函数来计算任意形状张量的平均值。
A.mean(), A.sum() / A.numel()
(tensor(9.5000), tensor(9.5000))同样，计算平均值的函数也可以沿指定轴降低张量的维度。
A.mean(axis=0), A.sum(axis=0) / A.shape[0]
(tensor([ 8.,  9., 10., 11.]), tensor([ 8.,  9., 10., 11.]))
#补充
a = torch.ones((2,5,4))
a.shape
torch.Size([2, 5, 4])
a.sum(axis=[0,2]).shape
torch.Size([5])
a.sum(axis=1, keepdims=True).shape
torch.Size([2, 1, 4])
a.sum(axis=[0,2], keepdims=True).shape
torch.Size([1, 5, 1])非降维求和

但是，有时在调用函数来[计算总和或均值时保持轴数不变]会很有用。
sum_A = A.sum(axis=1, keepdims=True)
sum_A
# 当axis=1时，keepdims = True可以使得A.sum()成果与A的轴数一致，才可以使用广播机制
tensor([[ 6.],
      [22.],
      [38.],
      [54.],
      [70.]])<hr/>Note:
广播机制的需满足的要求： 1. 每个张量至少有一个维度。 2. 迭代维度尺寸时，按最后一个维度对齐，维度尺寸
或者相等，
(例如：
x = torch.ones(2,3,4)
y = torch.ones(2,3,4))
或者此中一个张量的维度尺寸为 1 ,
(例如：
a = torch.ones(2,3,4)
b = torch.ones(2,3,1))
或者此中一个张量不存在这个维度。
(例如：
m = torch.ones(2,3,4)
n = torch.ones( 3,1))
广播会在缺掉和（或）长度为1的维度长进行。
<hr/># 出格举例：
i = torch.ones(5,1)
j = torch.ones(1,5)
i+j
# （ 5，1）
# （1，5）
tensor([[2., 2., 2., 2., 2.],
      [2., 2., 2., 2., 2.],
      [2., 2., 2., 2., 2.],
      [2., 2., 2., 2., 2.],
      [2., 2., 2., 2., 2.]])例如，由于sum_A在对每行进行求和后仍保持两个轴，我们可以(通过广播将A除以sum_A)。
A / sum_A
tensor([[0.0000, 0.1667, 0.3333, 0.5000],
      [0.1818, 0.2273, 0.2727, 0.3182],
      [0.2105, 0.2368, 0.2632, 0.2895],
      [0.2222, 0.2407, 0.2593, 0.2778],
      [0.2286, 0.2429, 0.2571, 0.2714]])如果我们想沿[某个轴计算A元素的累积总和]，比如axis=0（按行计算），可以调用cumsum函数。此函数不会沿任何轴降低输入张量的维度。
<hr/>Note:
这里说“axis=0为按行计算”可能有误解，cumsum函数成果实际为按元素第0个下标变化的标的目的进行求和，也即，将每列各行元素进行求和，并以每列各行元素的求和成果更新矩阵的最后一行。
<hr/>A.cumsum(axis=0)
tensor([[ 0.,  1.,  2.,  3.],
      [ 4.,  6.,  8., 10.],
      [12., 15., 18., 21.],
      [24., 28., 32., 36.],
      [40., 45., 50., 55.]])点积（Dot Product）

我们已经学习了按元素操作、求和及平均值。另一个最基本的操作之一是点积。给定两个向量 \mathbf{x},\mathbf{y}\in\mathbb{R}^d ，它们的点积（dot product） \mathbf{x}^\top\mathbf{y}  （或 \langle\mathbf{x},\mathbf{y}\rangle ）是不异位置的按元素乘积的和： \mathbf{x}^\top \mathbf{y} = \sum_{i=1}^{d} x_i y_i。
y = torch.ones(4, dtype = torch.float32)
x, y, torch.dot(x, y) # 点积成果返回一个标量
(tensor([0., 1., 2., 3.]), tensor([1., 1., 1., 1.]), tensor(6.))注意，(我们可以通过执行按元素乘法，然后进行求和来暗示两个向量的点积)：
torch.sum(x * y)
tensor(6.)点积在很多场所都很有用。例如，给定一组由向量 \mathbf{x} \in \mathbb{R}^d 暗示的值，和一组由 \mathbf{w} \in \mathbb{R}^d 暗示的权重。  \mathbf{x} 中的值按照权重 \mathbf{w} 的加权和，可以暗示为点积 \mathbf{x}^\top \mathbf{w} 。当权重为非负数且和为1（即 \left(\sum_{i=1}^{d}{w_i}=1\right) ）时，点积暗示加权平均（weighted average）。将两个向量规范化得到单元长度后，点积暗示它们夹角的余弦。本节后面的内容将正式介绍长度（length）的概念。
矩阵-向量积

此刻我们知道如何计算点积，可以开始理解矩阵-向量积（matrix-vector product）。回顾分袂在 :eqref:eq_matrix_def和 :eqref:eq_vec_def中定义的矩阵 \mathbf{A} \in \mathbb{R}^{m \times n} 和向量 \mathbf{x} \in \mathbb{R}^n 。让我们将矩阵 \mathbf{A} 用它的行向量暗示：
\mathbf{A}= \begin{bmatrix} \mathbf{a}^\top_{1} \\ \mathbf{a}^\top_{2} \\ \vdots \\ \mathbf{a}^\top_m \\ \end{bmatrix},
此中每个 \mathbf{a}^\top_{i} \in \mathbb{R}^n 都是行向量，暗示矩阵的第i行。 [矩阵向量积 \mathbf{A}\mathbf{x} 是一个长度为m的列向量，其第i个元素是点积 \mathbf{a}^\top_i \mathbf{x} ]：
\mathbf{A}\mathbf{x} = \begin{bmatrix} \mathbf{a}^\top_{1} \\ \mathbf{a}^\top_{2} \\ \vdots \\ \mathbf{a}^\top_m \\ \end{bmatrix}\mathbf{x} = \begin{bmatrix}  \mathbf{a}^\top_{1} \mathbf{x}  \\  \mathbf{a}^\top_{2} \mathbf{x} \\ \vdots\\  \mathbf{a}^\top_{m} \mathbf{x}\\ \end{bmatrix}.
我们可以把一个矩阵 \mathbf{A} \in \mathbb{R}^{m \times n} 乘法看作一个从 \mathbb{R}^{n} 到 \mathbb{R}^{m} 向量的转换。这些转换长短常有用的，例如可以用方阵的乘法来暗示旋转。后续章节将讲到，我们也可以使用矩阵-向量积来描述在给定前一层的值时，求解神经网络每一层所需的复杂计算。
<hr/>Note:
这里不理解矩阵A的行向量为何暗示为 \mathbf{a}^\top_{i} 而不是 \mathbf{a}{i} ，因为矩阵A本来就有m个行向量 \mathbf{a}{1}\cdots\mathbf{a}_{m}。
查资料后发现，在PyTorch中，一维张量凡是被视为列向量，但print成果是以行向量的形式显示的，也即，一维张量及其转置的print成果都是以行向量的形式显示，没有区别。
因此，在PyTorch中，默认 \mathbf{a}{i} 为列向量，若要将其暗示为矩阵A的第i个行向量，即为 \mathbf{a}^\top_{i}。
<hr/>在代码中使用张量暗示矩阵-向量积，我们使用mv函数。 当我们为矩阵A和向量x调用torch.mv(A, x)时，会执行矩阵-向量积。注意，A的列维数（沿轴1的长度）必需与x的维数（其长度）不异。
A.shape, x.shape, torch.mv(A, x)
(torch.Size([5, 4]), torch.Size([4]), tensor([ 14.,  38.,  62.,  86., 110.]))矩阵-矩阵乘法

在掌握点积和矩阵-向量积的常识后，那么矩阵-矩阵乘法（matrix-matrix multiplication）应该很简单。
假设有两个矩阵\mathbf{A} \in \mathbb{R}^{n \times k}和 \mathbf{B} \in \mathbb{R}^{k \times m}：
\mathbf{A}=\begin{bmatrix}  a_{11} & a_{12} & \cdots & a_{1k} \\  a_{21} & a_{22} & \cdots & a_{2k} \\ \vdots & \vdots & \ddots & \vdots \\  a_{n1} & a_{n2} & \cdots & a_{nk} \\ \end{bmatrix},\quad \mathbf{B}=\begin{bmatrix}  b_{11} & b_{12} & \cdots & b_{1m} \\  b_{21} & b_{22} & \cdots & b_{2m} \\ \vdots & \vdots & \ddots & \vdots \\  b_{k1} & b_{k2} & \cdots & b_{km} \\ \end{bmatrix}.
用行向量 \mathbf{a}^\top_{i} \in \mathbb{R}^k 暗示矩阵 \mathbf{A} 的第i行，并让列向量 \mathbf{b}_{j} \in \mathbb{R}^k 作为矩阵 \mathbf{B} 的第j列。要生成矩阵积\mathbf{C} = \mathbf{A}\mathbf{B} ，最简单的方式是考虑\mathbf{A} 的行向量和 \mathbf{B} 的列向量:
\mathbf{A}= \begin{bmatrix} \mathbf{a}^\top_{1} \\ \mathbf{a}^\top_{2} \\ \vdots \\ \mathbf{a}^\top_n \\ \end{bmatrix}, \quad \mathbf{B}=\begin{bmatrix}  \mathbf{b}_{1} & \mathbf{b}_{2} & \cdots & \mathbf{b}_{m} \\ \end{bmatrix}.  当我们简单地将每个元素 c_{ij} 计算为点积 \mathbf{a}^\top_i \mathbf{b}_j :
\mathbf{C} = \mathbf{AB} = \begin{bmatrix} \mathbf{a}^\top_{1} \\ \mathbf{a}^\top_{2} \\ \vdots \\ \mathbf{a}^\top_n \\ \end{bmatrix} \begin{bmatrix}  \mathbf{b}_{1} & \mathbf{b}_{2} & \cdots & \mathbf{b}_{m} \\ \end{bmatrix} = \begin{bmatrix} \mathbf{a}^\top_{1} \mathbf{b}_1 & \mathbf{a}^\top_{1}\mathbf{b}_2& \cdots & \mathbf{a}^\top_{1} \mathbf{b}_m \\  \mathbf{a}^\top_{2}\mathbf{b}_1 & \mathbf{a}^\top_{2} \mathbf{b}_2 & \cdots & \mathbf{a}^\top_{2} \mathbf{b}_m \\  \vdots & \vdots & \ddots &\vdots\\ \mathbf{a}^\top_{n} \mathbf{b}_1 & \mathbf{a}^\top_{n}\mathbf{b}_2& \cdots& \mathbf{a}^\top_{n} \mathbf{b}_m \end{bmatrix}.
[我们可以将矩阵-矩阵乘法\mathbf{AB}看作简单地执行m次矩阵-向量积，并将成果拼接在一起，形成一个n × m矩阵]。不才面的代码中，我们在A和B上执行矩阵乘法。这里的A是一个5行4列的矩阵，B是一个4行3列的矩阵。两者相乘后，我们得到了一个5行3列的矩阵。
B = torch.ones(4, 3)
torch.mm(A, B)
tensor([[ 6.,  6.,  6.],
      [22., 22., 22.],
      [38., 38., 38.],
      [54., 54., 54.],
      [70., 70., 70.]])矩阵-矩阵乘法可以简单地称为矩阵乘法，不应与”Hadamard积”混淆。
范数

线性代数中最有用的一些运算符是范数（norm）。非正式地说，向量的范数是暗示一个向量有多大。这里考虑的大小（size）概念不涉及维度，而是分量的大小。
在线性代数中，向量范数是将向量映射到标量的函数f。给定任意向量 \mathbf{x} ，向量范数要满足一些属性。第一个性质是：如果我们按常数因子 \alpha 缩放向量的所有元素，其范数也会按不异常数因子的绝对值缩放：
f(\alpha \mathbf{x}) = |\alpha| f(\mathbf{x}).
第二个性质是熟悉的三角不等式:
f(\mathbf{x} + \mathbf{y}) \leq f(\mathbf{x}) + f(\mathbf{y}).
第三个性质简单地说范数必需长短负的:
f(\mathbf{x}) \geq 0.
这是有道理的。因为在大大都情况下，任何东西的最小的大小是0。最后一个性质要求范数最小为0，当且仅当向量全由0组成。
\forall i, [\mathbf{x}]_i = 0 \Leftrightarrow f(\mathbf{x})=0.
范数听起来很像距离的度量。欧几里得距离和毕达哥拉斯定理中的非负性概念和三角不等式可能会给出一些启发。事实上，欧几里得距离是一个 L_2 范数：假设n维向量\mathbf{x}中的元素是 x_1,\ldots,x_n，其[L_2范数是向量元素平方和的平方根：]
(\|\mathbf{x}\|_2 = \sqrt{\sum_{i=1}^n x_i^2},)
此中，在L_2范数中常常省略下标2，也就是说 |\mathbf{x}| 等同于 |\mathbf{x}|_2 。在代码中，我们可以按如下方式计算向量的L_2 范数。
u = torch.tensor([3.0, -4.0])
torch.norm(u)
tensor(5.)深度学习中更经常地使用 L_2 范数的平方，也会经常遇到[ L_1 范数，它暗示为向量元素的绝对值之和：]
(\|\mathbf{x}\|_1 = \sum_{i=1}^n \left|x_i \right|.)
与 L_2 范数对比，L_1范数受异常值的影响较小。为了计算L_1范数，我们将绝对值函数和按元素求和组合起来。
torch.abs(u).sum()
tensor(7.)L_2范数和L_1范数都是更一般的 L_p 范数的特例：
\|\mathbf{x}\|_p = \left(\sum_{i=1}^n \left|x_i \right|^p \right)^{1/p}.
类似于向量的L_2范数，[矩阵] \mathbf{X} \in \mathbb{R}^{m \times n} (的Frobenius范数（Frobenius norm）是矩阵元素平方和的平方根：)
(\|\mathbf{X}\|_F = \sqrt{\sum_{i=1}^m \sum_{j=1}^n x_{ij}^2}.)
Frobenius范数满足向量范数的所有性质，它就像是矩阵形向量的L_2范数。调用以下函数将计算矩阵的Frobenius范数。
torch.norm(torch.ones((4, 9)))
tensor(6.)范数和方针

在深度学习中，我们经常试图解决优化问题： 最大化分配给不雅观测数据的概率; 最小化预测和真实不雅观测之间的距离。用向量暗示物品（如单词、产物或新闻文章），以便最小化相似项目之间的距离，最大化分歧项目之间的距离。方针，或许是深度学习算法最重要的组成部门（除了数据），凡是被表达为范数。
关于线性代数的更多信息

仅用一节，我们就教会了阅读本书所需的、用以理解现代深度学习的线性代数。线性代数还有很多，此中很大都学对于机器学习非常有用。例如，矩阵可以分化为因子，这些分化可以显示真实世界数据集中的低维布局。机器学习的整个子范围都侧重于使用矩阵分化及其向高阶张量的泛化，来发现数据集中的布局并解决预测问题。当开始动手测验考试并在真实数据集上应用了有效的机器学习模型，你会更倾向于学习更大都学。因此，这一节到此结束，本书将在后面介绍更大都学常识。
如果巴望了解有关线性代数的更多信息，可以参考线性代数运算的在线附录或其他优秀资源 :cite:Strang.1993,Kolter.2008,Petersen.Pedersen.ea.2008。
小结

标量、向量、矩阵和张量是线性代数中的基本数学对象。
向量泛化自标量，矩阵泛化自向量。
标量、向量、矩阵和张量分袂具有零、一、二和任意数量的轴。
一个张量可以通过sum和mean沿指定的轴降低维度。
两个矩阵的按元素乘法被称为他们的Hadamard积。它与矩阵乘法分歧。
在深度学习中，我们经常使用范数，如L_1范数、L_2范数和Frobenius范数。
我们可以对标量、向量、矩阵和张量执行各种操作。

操练

1.  证明一个矩阵 \mathbf{A} 的转置的转置是\mathbf{A}，即 (\mathbf{A}^\top)^\top = \mathbf{A} 。
  解：
令\mathbf{B} = \mathbf{A}^\top,  则b_{ij} = a_{ji} ;  再令\mathbf{C} = \mathbf{B}^\top,  即\mathbf{C} = (\mathbf{A}^\top)^\top, 则c_{ij} = b_{ji} = a_{ij} ,  即证(\mathbf{A}^\top)^\top = \mathbf{A}

2. 给出两个矩阵\mathbf{A}和 \mathbf{B}，证明“它们转置的和”等于“它们和的转置”，即\mathbf{A}^\top + \mathbf{B}^\top = (\mathbf{A} + \mathbf{B})^\top。
解：
\because \mathbf{A}^\top + \mathbf{B}^\top = a_{ji} + b_{ji} = c_{ji} = \mathbf{C}^\top, 且 (\mathbf{A} + \mathbf{B})^\top = (a_{ij} + b_{ij})^\top = c_{ij}^\top = \mathbf{C}^\top
\therefore \mathbf{A}^\top + \mathbf{B}^\top = (\mathbf{A} + \mathbf{B})^\top

3. 给定任意方阵 \mathbf{A} ，\mathbf{A} + \mathbf{A}^\top 总是对称的吗?为什么?
解：
对于任意方阵\mathbf{A}，\mathbf{A} + \mathbf{A}^\top总是对称的，证明如下：
\because\mathbf{A}^\top + \mathbf{B}^\top = a_{ji} + b_{ji} = c_{ji} = \mathbf{C}^\top, \mathbf{A} + \mathbf{A}^\top = a_{ij} + a_{ji},
且 (\mathbf{A} + \mathbf{A}^\top)^\top = (a_{ij} + a_{ji})^\top = a^\top_{ij} + a^\top_{ji} = a_{ji} + a_{ij}
\therefore \mathbf{A} + \mathbf{A}^\top = (\mathbf{A} + \mathbf{A}^\top)^\top
也即，对于任意方阵\mathbf{A}，\mathbf{A} + \mathbf{A}^\top总是对称的。

4. 本节中定义了形状(2,3,4)的张量X。len(X)的输出成果是什么？
解：
len(X)的输出成果为2。

5. 对于任意形状的张量X,len(X)是否总是对应于X特定轴的长度?这个轴是什么?
解：
对于任意形状的张量X,len(X)总是对应于X特定轴axis=0的长度。

6. 运行A/A.sum(axis=1)，看看会发生什么。请分析一下原因？
解：
运行A/A.sum(axis=1)会报错，因为A.shape()输出成果为(5,4)，而A.sum(axis=1).shape()输出成果为(5)，二者最后一个轴的维度纷歧致，不符合广播机制的要求，故报错。
改为A/A.sum(axis=1, keepdims=True)后，即可成功运行，因为A.sum(axis=1, keepdims=True).shape()输出成果为(5,1),即A.sum(axis=1, keepdims=True)与A的轴数一致，且轴1维度为1，轴0维度与A不异，符合广播机制的要求，故可运行。

7. 考虑一个具有形状(2,3,4)的张量，在轴0、1、2上的求和输出是什么形状?
解：
axis=0求和，shape为(3,4)；
axis=1求和，shape为(2,4)；
axis=2求和，shape为(2,3)。

8. 为linalg.norm函数提供3个或更多轴的张量，并不雅察看其输出。对于任意形状的张量这个函数计算得到什么? 解：
linalg = linear（线性）+ algebra（代数），norm暗示范数。
函数为A_norm = torch.linalg.norm(A, ord=None, dim=None, keepdim=False, *, out=None, dtype=None))
此中： 1）ord为范数类型：默认ord=2，即$L_2$范数，ord=fro为Frobenius范数，ord=nuc为核范数，ord=inf为沿dim=1求和后的最大值（max(sum(abs(A), dim=1))），ord=-inf为沿dim=1求和后的最小值（min(sum(abs(A), dim=1))）……
2）dim为指定求范数的维度，
·如果dim是一个int，将计算向量范数；
·如果dim是一个二元组，将计算矩阵范数；
·如果dim=None, ord = None，则A被展为一维张量，并计算$L_2$范数；
·如果dim=None, ord != None，则A必需为一维或二维张量。
3）keepdim指是否保留dim指定的维度。总之，
当所有参数默认时，linalg.norm函数的输出成果为一个标量，也即将张量的形状展为一维后计算$L_2$范数；
当指定dim时，linalg.norm函数的输出成果为消掉指定dim维度后的张量；
当指定dim但keepdim=True时，linalg.norm函数的输出成果为将指定dim的维度降为1后的张量。
试验如下：

A = torch.tensor([3.,4.])
torch.linalg.norm(A), torch.linalg.norm(A).shape输出成果：(tensor(5.), torch.Size([]))
A = torch.ones(2,3,4)
torch.linalg.norm(A), torch.linalg.norm(A).shape输出成果：(tensor(4.8990), torch.Size([]))
A = torch.ones(2,3,4)
torch.linalg.norm(A, dim=(1)), torch.linalg.norm(A, dim=(1)).shape输出成果：
(tensor([[1.7321, 1.7321, 1.7321, 1.7321],
      [1.7321, 1.7321, 1.7321, 1.7321]]),
torch.Size([2, 4]))
A = torch.ones(2,3,4)
print(torch.linalg.norm(A, dim=(1,2)),torch.linalg.norm(A, dim=(1,2)).shape)输出成果：tensor([3.4641, 3.4641]) torch.Size([2])
A = torch.ones(2,3,4)
print(torch.linalg.norm(A, dim=(1,2),keepdim=True),torch.linalg.norm(A, dim=(1,2),keepdim=True).shape)输出成果：tensor([[[3.4641]],

      [[3.4641]]]) torch.Size([2, 1, 1])
另附norm函数和linalg.norm函数的官方解释：
help(torch.norm)
Help on function norm in module torch.functional:

norm(input, p='fro', dim=None, keepdim=False, out=None, dtype=None)
Returns the matrix norm or vector norm of a given tensor.

.. warning::

      torch.norm is deprecated and may be removed in a future PyTorch release.
      Its documentation and behavior may be incorrect, and it is no longer
      actively maintained.

      Use :func:`torch.linalg.norm`, instead, or :func:`torch.linalg.vector_norm`
      when computing vector norms and :func:`torch.linalg.matrix_norm` when
      computing matrix norms. Note, however, the signature for these functions
      is slightly different than the signature for torch.norm.

Args:
      input (Tensor): The input tensor. Its data type must be either a floating
         point or complex type. For complex inputs, the norm is calculated using the
         absolute value of each element. If the input is complex and neither
         :attr:`dtype` nor :attr:`out` is specified, the result's data type will
         be the corresponding floating point type (e.g. float if :attr:`input` is
         complexfloat).

      p (int, float, inf, -inf, 'fro', 'nuc', optional): the order of norm. Default: ``'fro'``
         The following norms can be calculated:

         ======  ==============  ==========================
         ord    matrix norm    vector norm
         ======  ==============  ==========================
         'fro' Frobenius norm  --
         'nuc' nuclear norm --
         Number  --             sum(abs(x)**ord)**(1./ord)
         ======  ==============  ==========================

         The vector norm can be calculated across any number of dimensions.
         The corresponding dimensions of :attr:`input` are flattened into
         one dimension, and the norm is calculated on the flattened
         dimension.

         Frobenius norm produces the same result as ``p=2`` in all cases
         except when :attr:`dim` is a list of three or more dims, in which
         case Frobenius norm throws an error.

         Nuclear norm can only be calculated across exactly two dimensions.

      dim (int, tuple of ints, list of ints, optional):
         Specifies which dimension or dimensions of :attr:`input` to
         calculate the norm across. If :attr:`dim` is ``None``, the norm will
         be calculated across all dimensions of :attr:`input`. If the norm
         type indicated by :attr:`p` does not support the specified number of
         dimensions, an error will occur.
      keepdim (bool, optional): whether the output tensors have :attr:`dim`
         retained or not. Ignored if :attr:`dim` = ``None`` and
         :attr:`out` = ``None``. Default: ``False``
      out (Tensor, optional): the output tensor. Ignored if
         :attr:`dim` = ``None`` and :attr:`out` = ``None``.
      dtype (:class:`torch.dtype`, optional): the desired data type of
         returned tensor. If specified, the input tensor is casted to
         :attr:`dtype` while performing the operation. Default: None.

.. note::
      Even though ``p='fro'`` supports any number of dimensions, the true
      mathematical definition of Frobenius norm only applies to tensors with
      exactly two dimensions. :func:`torch.linalg.norm` with ``ord='fro'`` aligns
      with the mathematical definition, since it can only be applied across
      exactly two dimensions.

Example::

      >>> import torch
      >>> a = torch.arange(9, dtype= torch.float) - 4
      >>> b = a.reshape((3, 3))
      >>> torch.norm(a)
      tensor(7.7460)
      >>> torch.norm(b)
      tensor(7.7460)
      >>> torch.norm(a, float('inf'))
      tensor(4.)
      >>> torch.norm(b, float('inf'))
      tensor(4.)
      >>> c = torch.tensor([[ 1, 2, 3],[-1, 1, 4]] , dtype= torch.float)
      >>> torch.norm(c, dim=0)
      tensor([1.4142, 2.2361, 5.0000])
      >>> torch.norm(c, dim=1)
      tensor([3.7417, 4.2426])
      >>> torch.norm(c, p=1, dim=1)
      tensor([6., 6.])
      >>> d = torch.arange(8, dtype= torch.float).reshape(2,2,2)
      >>> torch.norm(d, dim=(1,2))
      tensor([ 3.7417, 11.2250])
      >>> torch.norm(d[0, :, :]), torch.norm(d[1, :, :])
      (tensor(3.7417), tensor(11.2250))
help(torch.linalg.norm)
Help on built-in function linalg_norm in module torch._C._linalg:

linalg_norm(...)
linalg.norm(A, ord=None, dim=None, keepdim=False, *, out=None, dtype=None) -> Tensor

Computes a vector or matrix norm.

Supports input of float, double, cfloat and cdouble dtypes.

Whether this function computes a vector or matrix norm is determined as follows:

- If :attr:`dim` is an `int`, the vector norm will be computed.
- If :attr:`dim` is a `2`-`tuple`, the matrix norm will be computed.
- If :attr:`dim`\ `= None` and :attr:`ord`\ `= None`,
   :attr:`A` will be flattened to 1D and the `2`-norm of the resulting vector will be computed.
- If :attr:`dim`\ `= None` and :attr:`ord` `!= None`, :attr:`A` must be 1D or 2D.

:attr:`ord` defines the norm that is computed. The following norms are supported:

======================    =========================  ========================================================
:attr:`ord`             norm for matrices       norm for vectors
======================    =========================  ========================================================
`None` (default)          Frobenius norm          `2`-norm (see below)
`'fro'`                   Frobenius norm          -- not supported --
`'nuc'`                   nuclear norm             -- not supported --
`inf`                   `max(sum(abs(x), dim=1))`  `max(abs(x))`
`-inf`                   `min(sum(abs(x), dim=1))`  `min(abs(x))`
`0`                      -- not supported --       `sum(x != 0)`
`1`                      `max(sum(abs(x), dim=0))`  as below
`-1`                      `min(sum(abs(x), dim=0))`  as below
`2`                      largest singular value    as below
`-2`                      smallest singular value as below
other `int` or `float`    -- not supported --       `sum(abs(x)^{ord})^{(1 / ord)}`
======================    =========================  ========================================================

where `inf` refers to `float('inf')`, NumPy's `inf` object, or any equivalent object.

.. seealso::

         :func:`torch.linalg.vector_norm` computes a vector norm.

         :func:`torch.linalg.matrix_norm` computes a matrix norm.

         The above functions are often clearer and more flexible than using :func:`torch.linalg.norm`.
         For example, `torch.linalg.norm(A, ord=1, dim=(0, 1))` always
         computes a matrix norm, but with `torch.linalg.vector_norm(A, ord=1, dim=(0, 1))` it is possible
         to compute a vector norm over the two dimensions.

Args:
      A (Tensor): tensor of shape `(*, n)` or `(*, m, n)` where `*` is zero or more batch dimensions
      ord (int, float, inf, -inf, 'fro', 'nuc', optional): order of norm. Default: `None`
      dim (int, Tuple[int], optional): dimensions over which to compute
         the vector or matrix norm. See above for the behavior when :attr:`dim`\ `= None`.
         Default: `None`
      keepdim (bool, optional): If set to `True`, the reduced dimensions are retained
         in the result as dimensions with size one. Default: `False`

Keyword args:
      out (Tensor, optional): output tensor. Ignored if `None`. Default: `None`.
      dtype (:class:`torch.dtype`, optional): If specified, the input tensor is cast to
         :attr:`dtype` before performing the operation, and the returned tensor's type
         will be :attr:`dtype`. Default: `None`

Returns:
      A real-valued tensor, even when :attr:`A` is complex.

Examples::

      >>> from torch import linalg as LA
      >>> a = torch.arange(9, dtype=torch.float) - 4
      >>> a
      tensor([-4., -3., -2., -1.,  0.,  1.,  2.,  3.,  4.])
      >>> B = a.reshape((3, 3))
      >>> B
      tensor([[-4., -3., -2.],
            [-1.,  0.,  1.],
            [ 2.,  3.,  4.]])

      >>> LA.norm(a)
      tensor(7.7460)
      >>> LA.norm(B)
      tensor(7.7460)
      >>> LA.norm(B, 'fro')
      tensor(7.7460)
      >>> LA.norm(a, float('inf'))
      tensor(4.)
      >>> LA.norm(B, float('inf'))
      tensor(9.)
      >>> LA.norm(a, -float('inf'))
      tensor(0.)
      >>> LA.norm(B, -float('inf'))
      tensor(2.)

      >>> LA.norm(a, 1)
      tensor(20.)
      >>> LA.norm(B, 1)
      tensor(7.)
      >>> LA.norm(a, -1)
      tensor(0.)
      >>> LA.norm(B, -1)
      tensor(6.)
      >>> LA.norm(a, 2)
      tensor(7.7460)
      >>> LA.norm(B, 2)
      tensor(7.3485)

      >>> LA.norm(a, -2)
      tensor(0.)
      >>> LA.norm(B.double(), -2)
      tensor(1.8570e-16, dtype=torch.float64)
      >>> LA.norm(a, 3)
      tensor(5.8480)
      >>> LA.norm(a, -3)
      tensor(0.)

Using the :attr:`dim` argument to compute vector norms::

      >>> c = torch.tensor([[1., 2., 3.],
      ...                [-1, 1, 4]])
      >>> LA.norm(c, dim=0)
      tensor([1.4142, 2.2361, 5.0000])
      >>> LA.norm(c, dim=1)
      tensor([3.7417, 4.2426])
      >>> LA.norm(c, ord=1, dim=1)
      tensor([6., 6.])

Using the :attr:`dim` argument to compute matrix norms::

      >>> A = torch.arange(8, dtype=torch.float).reshape(2, 2, 2)
      >>> LA.norm(A, dim=(1,2))
      tensor([ 3.7417, 11.2250])
      >>> LA.norm(A[0, :, :]), LA.norm(A[1, :, :])
      (tensor(3.7417), tensor(11.2250))

		自动登录	找回密码
密码			立即注册