当前位置：首页 > news >正文

网站界面设计的发展网站开发客户流程 6个阶段

news 2025/11/16 6:38:08

网站界面设计的发展,网站开发客户流程 6个阶段,seo公司官网,对电子商务网站建设的感想1. 使用说明在megatron中指定--use-distributed-optimizer就能开启分布式优化器, 参数定义在megatron/arguments.py中。分布式优化器的思路是将训练中的优化器状态均匀地分布到不同数据并行的rank结点上#xff0c;相当于开启ZERO-1的训练。 group.add_argument(--use-distr…1. 使用说明在megatron中指定--use-distributed-optimizer就能开启分布式优化器, 参数定义在megatron/arguments.py中。分布式优化器的思路是将训练中的优化器状态均匀地分布到不同数据并行的rank结点上相当于开启ZERO-1的训练。 group.add_argument(--use-distributed-optimizer, actionstore_true,helpUse distributed optimizer.)在使用--use-distributed-optimizer, 同时会check两个参数 args.DDP_impl local(默认开启)和args.use_contiguous_buffers_in_local_ddp(默认开启)。 # If we use the distributed optimizer, we need to have local DDP# and we should make sure use-contiguous-buffers-in-local-ddp is on.if args.use_distributed_optimizer:assert args.DDP_impl localassert args.use_contiguous_buffers_in_local_ddp分布式优化器节省的理论显存值依赖参数类型和梯度类型以下是每一个parameter对应占用的理论字节数(d表示数据并行的size大小也就是一个数据并行中的卡数, 等于 T P × P P TP \times PP TP×PP ) 训练数据类型Non-distributed optim单位ByteDistributed optim单位Bytefloat16 param, float16 grads204 16/dfloat16 param, fp32 grads186 12/dfp32 param, fp32 grads168 8/d 2. 实现介绍 Distributed-Optimizer分布式优化器的主要实现是通过连续的grad buffer来进行的grad buffer中用于模型状态和优化器状态之间进行parameter参数和grad梯度的通信。grad buffer中使用reduce-scatter和all-gather进行通信。数据流如下在每个dp的rank上计算完grad后组成待更新的grad buffer数组更新的时候通过reduce-scatter将grad buffer切分到各个rank上在每个rank上完成优化器的step操作最后将所有结果执行allgather操作得到更新后的grad buffer。以fp16类型grad为例grad buffer分片说明如下一共有4个参数分别用绿/黄/蓝/红表示总参数大小为16个fp16类型数据按DP中rank的个数对总数据均匀切分如果参数过大每个rank可能会只包含部分参数的数据所以要考虑参数的偏移每个DP rank中的每个param参数都对应有3个偏移一个是world_index表示总的数据偏移一个是local_index表示在当前rank中的数据偏移一个是param_index相对于param来说表示当前rank结点存的数据的偏移。以黄色参数Param1为例在rank0存了Param1的一个元素rank1存了Param1的4个元素world_index来说rank0上黄色部分的元素是总数据的[3,4], rank1上黄色部分的4个元素是总数据的[4,8]; local_index来说在rank0上表示[3,4]rank1表示当前结点全部的4个元素范围也就是[0,4];param_index来说对于rank0上的Param1的param_index就是[0,1]在rank2上的param_index就是[1,5]; 关键步骤详解上图中每个方块看成是一个grad buffer中的一个fp16类型元素在反向结束以后grad buffer中有16个fp16类型的元素在每一个DP rank上调用reduce-scatter操作每个DP rank的grad buffer中都有4个fp16类型元素经过了reduce-scatter操作更新没更新的12个fp16类型元素等待后续垃圾回收每个DP rank从grad buffer中拷贝更新后的4个fp16类型元素到fp32类型的main grad buffer中准备开始后续的更新操作例如 DP rank0拷贝[0:4]个元素DP rank1拷贝[4:8]个元素DP rank2拷贝[8:12]个元素DP rank3拷贝[12:16]个元素执行Optimizer.step(), step()操作必须通过fp32类型来进行计算每个DP rank从main grad buffer中拷贝step()更新后的4个fp32类型元素到fp16类型的grad buffer中执行allgather操作, 这样每个grad buffer就都是最新更新后的数据了基于grad buffer来更新各个模型的fp16类型的参数开始进行下一轮的更新 3. 源码实现 3.1 程序入口初始化的入口在文件megatron/training.py的get_model函数中在创建LocalDDP的实例中会传入args.use_contiguous_buffers_in_local_ddp。 from torch.nn.parallel.distributed import DistributedDataParallel as torchDDPdef get_model(model_provider_func, model_typeModelType.encoder_or_decoder, wrap_with_ddpTrue):...if wrap_with_ddp:if args.DDP_impl torch:...elif args.DDP_impl local:model [LocalDDP(model_module,args.accumulate_allreduce_grads_in_fp32,args.use_contiguous_buffers_in_local_ddp)for model_module in model]...训练的入口定义在train_step函数中, 基本流程如下 def train_step(forward_step_func, data_iterator,model, optimizer, opt_param_scheduler):...# 清除gradif args.DDP_impl local and args.use_contiguous_buffers_in_local_ddp:for partition in model:partition.zero_grad_buffer()optimizer.zero_grad()...# 执行前反向计算losses_reduced forward_backward_func(...)...# 对梯度执行Reduce-Scatter操作optimizer.reduce_model_grads(args, timers)...# 更新梯度timers(optimizer, log_level1).start(barrierargs.barrier_with_L1_time)update_successful, grad_norm, num_zeros_in_grad optimizer.step(args, timers)timers(optimizer).stop()...# 对更新后的param执行gather操作if update_successful:optimizer.gather_model_params(args, timers)...# 通过scheduler更新学习率if update_successful:increment get_num_microbatches() * \args.micro_batch_size * \args.data_parallel_sizeopt_param_scheduler.step(incrementincrement)skipped_iter 0else:skipped_iter 1...3.2 grad buffer初始化DistributedDataParallel类 grad buffer初始化是在类DistributedDataParallel的init函数中, 源码定义在megatron/optimizer/distrib_optimizer.py文件中。 class DistributedDataParallel(DistributedDataParallelBase):def __init__(self, module,accumulate_allreduce_grads_in_fp32,use_contiguous_buffers):创建grad buffer和index map self._grad_buffers {}self._grad_buffer_param_index_map {}data_parallel_world_size mpu.get_data_parallel_world_size()按类型分别计算每个类型元素的个数使用type_num_elements map进行存储key是元素类型value是类型出现的元素个数 # First calculate total number of elements per type.type_num_elements {}for param in self.module.parameters():if param.requires_grad:dtype _get_buffer_type(param)type_num_elements[dtype] type_num_elements.get(dtype, 0) \ param.data.nelement()实际开始分配grad buffer, 为了支持被DP并行数正好切分需要先对每个类型出现的个数进行padding操作然后通过MemoryBuffer进行存储的分配 # Allocate the buffer.for dtype, num_elements in type_num_elements.items():# If using distributed optimizer, pad memory buffer to be# multiple of data_parallel_world_size. (This padding is done# due to a constraint with the reduce_scatter op, which requires# all tensors have equal size. See: optimizer.py.)num_elements_padded data_parallel_world_size * \int(math.ceil(num_elements / data_parallel_world_size))# Allocate grad buffer.self._grad_buffers[dtype] MemoryBuffer(num_elements,num_elements_padded,dtype)从grad buffer中给每一个param参数分配对应的main_grad空间在分配main_grad时根据每个param参数的类型从对应的self._grad_buffers[dtype]中得到跟param.data.shape一样的tensor这里的tensor与grad buffer共享存储。同时grad buffer的分配是按倒序来分配的比如self.module.parameters()中有三个参数分别是[p1, p2, p3], 在grad buffer中存储则是[p3_grad, p2_grad, p1_grad]。_grad_buffer_param_index_map用来记录每个param的梯度在grad buffer中存储的起始和结束位置。 ...# Assume the back prop order is reverse the params order,# store the start index for the gradients.for param in self.module.parameters():if param.requires_grad:dtype _get_buffer_type(param)type_num_elements[dtype] - param.data.nelement()# get的第二个参数是start_index这里的start_index是从grad_buffer从大到小来算的param.main_grad self._grad_buffers[dtype].get(param.data.shape, type_num_elements[dtype])if dtype not in self._grad_buffer_param_index_map:self._grad_buffer_param_index_map[dtype] {}self._grad_buffer_param_index_map[dtype][param] (type_num_elements[dtype],type_num_elements[dtype] param.data.nelement(),)遍历每一个参数对于每一个参数的grad_fn的下一个function累加grad_acc函数进行改写由于param本身没有grad_fn通过trick方式使用param.expand_as给param加上了grad_fn函数。 ...# Backward hook.# Accumalation function for the gradients. We need# to store them so they dont go out of scope.self.grad_accs []# Loop over all the parameters in the model.for param in self.module.parameters():if param.requires_grad:# 使用expand_as使param具有grad_fn.param_tmp param.expand_as(param)# 获取梯度累加函数并注册hook改写grad_acc param_tmp.grad_fn.next_functions[0][0]grad_acc.register_hook(self._make_param_hook(param))self.grad_accs.append(grad_acc)def _make_param_hook(self, param):Create the all-reduce hook for backprop.# Hook used for back-prop.def param_hook(*unused):# Add the gradient to the buffer.if param.grad is not None:# The gradient function of linear layers is fused with GEMMsparam.main_grad.add_(param.grad.data)# Now we can deallocate grad memory.param.grad Nonereturn param_hook4. 参考 Megatron-LM源码系列(六)Distributed-Optimizer分布式优化器实现Part1

查看全文

http://www.pierceye.com/news/393115/