当前位置: 首页 > news >正文

电商网站开发视频教程建设银行企业网上银行

电商网站开发视频教程,建设银行企业网上银行,产品推广策划书,百度seo规则pytorch为自己的extension backend添加profiler功能 1.参考文档2.your-extension-for-pytorch需要增加的代码3.pytorch demo及如何调整chrome trace json文件4.[可视化](https://ui.perfetto.dev/) 本文演示了pytorch如何为自己的extension backend添加profiler功能 背景介绍 … pytorch为自己的extension backend添加profiler功能 1.参考文档2.your-extension-for-pytorch需要增加的代码3.pytorch demo及如何调整chrome trace json文件4.[可视化](https://ui.perfetto.dev/) 本文演示了pytorch如何为自己的extension backend添加profiler功能 背景介绍 1.没有CNLight、Profiling AscendCL API、ROC Trace之类Profing功能,无法trace runtime,drive,kernel,也无法获取设备的metrics2.只有event功能,可以统计kernel耗时3.本文只是一种尝试,并不合理.4.torch原生的profiler框架,依赖kineto,kineto目前支持CUPTI和ROC Tracer,如果不修改torch源码,第三方设备不方便使用5.华为、寒武纪、habana都是采用torch.profile的接口形式及at::addThreadLocalCallback功能,但不依赖torch.profiler框架 profing原始数据都是私有格式,并且修改TensorBoard的插件,可于可视化 实施步骤 1.调用torch::profiler::impl::registerPrivateUse1Methods注册2.因为没有correlation ID去关联host api与kernel,因此export_chrome_trace出来的数据没有kernel信息3.获取prof.profiler.function_events里的数据,通过{ev.name}{ev.id}{ev.thread}拼成uuid与上面chrome trace中的events关联4.因为只有一个stream。可以根据Host lanuch时间、kernel耗时、launch latency(先验),推断出kernel的开始、结束时间,并用flow event进行关联(虽然并不准确)5.最后把kernel event以及flow event追加到chrome trace中 1.参考文档 ROC TracerCUPTI华为profiler_npuProfiling AscendCL API寒武纪profile_mlu寒武纪CNLighthabana torchintel_extension_for_pytorchMake the kineto extendable for other runtime than CUDpytorch_open_registration_examplerename_privateuse1_backendTrace Event Format 2.your-extension-for-pytorch需要增加的代码 #include torch/csrc/profiler/stubs/base.h #include torch/csrc/profiler/util.h #include c10/util/irange.h #include torch/csrc/profiler/stubs/base.h #include torch/csrc/profiler/util.husing torch::profiler::impl::ProfilerStubs; using torch::profiler::impl::ProfilerVoidEventStub;namespace torch { namespace profiler { namespace impl {struct NPUMethods : public ProfilerStubs {void record(int* device,ProfilerVoidEventStub* event,int64_t* cpu_ns) const override{if (device) {TORCH_CHECK(xpurtGetDevice((uint32_t*)device));}xpurtEvent_t xpurt_event;TORCH_CHECK(xpurtEventCreate(xpurt_event));*event std::shared_ptrvoid(xpurt_event, [](xpurtEvent_t ptr) {TORCH_CHECK(xpurtEventDestroy(ptr));});auto xpurt_stream c10::xpu::getCurrentxpuStream(vastai::get_device());if (cpu_ns) {*cpu_ns getTime();}TORCH_CHECK(xpurtEventRecord(xpurt_event, xpurt_stream)); } float elapsed(const ProfilerVoidEventStub* event1_,const ProfilerVoidEventStub* event2_) const override{auto event1 static_castxpurtEvent_t(event1_-get());TORCH_CHECK(xpurtEventSynchronize(event1));auto event2 static_castxpurtEvent_t(event2_-get());TORCH_CHECK(xpurtEventSynchronize(event2));int64_t time_ms 0;TORCH_CHECK(xpurtEventElapsedTime(time_ms, event1, event2));return time_ms*1.0;} void onEachDevice(std::functionvoid(int) op) const override{uint32_t device 0;TORCH_CHECK(xpurtGetDevice(device));op(device);} void synchronize() const override { } bool enabled() const override {return true;} void mark(const char*name) const override { } void rangePush(const char*name) const override { } void rangePop() const override {} };struct RegisterNPUMethods {RegisterNPUMethods(){static NPUMethods methods;torch::profiler::impl::registerPrivateUse1Methods(methods);} }; RegisterNPUMethods reg; }}}3.pytorch demo及如何调整chrome trace json文件 import time import torchvision.models as models from torch import nn import torch.nn.functional as F import copy import math import torch from torch.profiler import profile import json import tqdmdef is_valid_kernel(name,duration,valid_kernel_threshold100):通过算子的名字和耗时判断是否是Device Kernelinvalid_kernels[aten::view,aten::reshape,aten::t,aten::empty,aten::transpose,aten::as_strided,aten::item,aten::_local_scalar_dense,aten::result_type,aten::_unsafe_view,aten::expand]for k in invalid_kernels:if name.find(k)0:return Falseif durationvalid_kernel_threshold:return False return Truedef filter_ev(ev):过滤Kernelif args in ev and External id in ev[args]:return Truereturn Falsedef get_uuid(ev,tid_map):return f{ev[name]}_{ev[args][External id]}_{tid_map[ev[tid]]}def get_valid_kernels(traceEvents,kernel_event,tid_map):valid_kernels[]device_memory_usage0for ev in traceEvents:if filter_ev(ev):uuidget_uuid(ev,tid_map)if uuid not in kernel_event:continuedurationkernel_event[uuid][kernel_time]kernel_nameev[name]if kernel_event[uuid][device_memory_usage]0:device_memory_usagekernel_event[uuid][device_memory_usage]if is_valid_kernel(kernel_name,duration):launch_begev[ts]launch_endev[ts]ev[dur] valid_kernels.append({name:kernel_name,launch_beg:launch_beg,launch_end:launch_end,kernel_duration:duration,host_pid:ev[pid],host_tid:ev[tid],device_memory_usage:device_memory_usage,is_leaf_kernel:False})return sorted(valid_kernels,keylambda x:x[launch_beg])def is_leaf_kernel(kernel,valid_kernels):判断是否是叶子KernelretTruefor k in valid_kernels:if k[is_leaf_kernel]:continue#自己的时间跨度内还有别的Kernelif k[launch_beg]kernel[launch_beg] and k[launch_end]kernel[launch_end]:retFalsebreakreturn retdef create_tid_map(traceEvents):tidsset()for ev in traceEvents:if filter_ev(ev):tidev[tid]tids.add(tid)tid_map{}tidssorted(tids,reverseFalse)for i,v in enumerate(tids):tid_map[v]i1return tid_mapdef merge_prof_timeline(prof_json,kernel_event_json,output_json):kernel_lanuch_latency0with open(prof_json,r,encodingutf-8) as f:prof json.load(f)with open(kernel_event_json,r,encodingutf-8) as f:kernel_event json.load(f) traceEventsprof[traceEvents]tid_mapcreate_tid_map(traceEvents)print(tid_map)#获取所有kernelvalid_kernelsget_valid_kernels(traceEvents,kernel_event,tid_map)print(len(valid_kernels))#筛出所有会在device上执行的kernelon_device_kernels[]for kernel in tqdm.tqdm(valid_kernels):if is_leaf_kernel(kernel,valid_kernels):on_device_kernels.append(kernel)kernel_start_offset0kernel_index0for kernel in on_device_kernels:namekernel[name]kernel_durationkernel[kernel_duration]lanuch_timekernel[launch_beg]host_pidkernel[host_pid]host_tidkernel[host_tid]device_memory_usagekernel[device_memory_usage]if kernel_start_offset0:kernel_start_offsetlanuch_timekernel_start_offsetif lanuch_timekernel_start_offset: #kernel 队列空闲kernel_start_offsetlanuch_time#增加kernel事件traceEvents.append({ph: X, cat: device_kernel, name:name, pid: 10, tid: 10,ts: kernel_start_offset, dur: kernel_duration})#增加内存事件traceEvents.append({ph: C, cat: memory, name:memory, pid: 11, tid: 11,ts: lanuch_time, args: {value:device_memory_usage}})#增加flow eventtraceEvents.append({ph: s, id: kernel_index, pid: host_pid, tid: host_tid, ts: lanuch_time,cat: ac2g, name: ac2g})traceEvents.append({ph: f, id: kernel_index, pid: 10, tid: 10,ts: kernel_start_offset,cat: ac2g, name: ac2g, bp: e})kernel_index1kernel_start_offset(kernel_durationkernel_lanuch_latency)#保存最终的结果with open(output_json,w,encodingutf-8) as f:json.dump(prof, f,ensure_asciiFalse,indent4)def clones(module, N):return nn.ModuleList([copy.deepcopy(module) for _ in range(N)])class ScaledDotProductAttention(nn.Module):def __init__(self):super(ScaledDotProductAttention, self).__init__()def forward(self,query, key, value, maskNone, dropoutNone):d_k query.size(-1)scores querykey.transpose(-2,-1) / math.sqrt(d_k)if mask is not None:scores scores.masked_fill(mask 0, -1e20)p_attn F.softmax(scores, dim -1)if dropout is not None:p_attn dropout(p_attn)return p_attnvalue, p_attnclass MultiHeadAttention(nn.Module):def __init__(self, h, d_model, dropout0.1):super(MultiHeadAttention, self).__init__()assert d_model % h 0self.d_k d_model // hself.h hself.linears clones(nn.Linear(d_model, d_model), 4)self.attn Noneself.dropout nn.Dropout(pdropout)self.attention ScaledDotProductAttention()def forward(self, query, key, value, maskNone):if mask is not None:mask mask.unsqueeze(1)nbatches query.size(0)queryself.linears[0](query).view(nbatches, -1, self.h, self.d_k)queryquery.transpose(1, 2)keyself.linears[1](key).view(nbatches, -1, self.h, self.d_k)keykey.transpose(1, 2)valueself.linears[2](value).view(nbatches, -1, self.h, self.d_k).transpose(1, 2)x, self.attn self.attention(query, key, value, maskmask,dropoutself.dropout)x x.transpose(1, 2).contiguous().view(nbatches, -1, self.h * self.d_k)return self.linears[-1](x)use_cudaTrue try:import torch_xpuimport torch_xpu.contrib.transfer_to_xputorch.xpu.set_device(0)torch.profiler.ProfilerActivity.PrivateUse1xpuuse_cudaFalse except:passimport os os.environ[LOCAL_RANK]0 os.environ[RANK]0 os.environ[WORLD_SIZE]1 os.environ[MASTER_ADDR]localhost os.environ[MASTER_PORT]6006import torch.distributed as dist dist.init_process_group(backendvccl) local_rankint(os.environ[LOCAL_RANK]) ranktorch.distributed.get_rank() torch.cuda.set_device(local_rank) if not dist.is_available() or not dist.is_initialized():print(dist init error)cross_attn MultiHeadAttention(h8, d_model64).half().cuda() cross_attn.eval() q1 torch.ones((1, 50, 64),dtypetorch.float32).half().cuda() k1 q1.clone() v1 q1.clone() out cross_attn.forward(q1,k1,v1).sum() torch.cuda.synchronize()activities[torch.profiler.ProfilerActivity.CPU] if use_cuda:activities.append(torch.profiler.ProfilerActivity.CUDA)with profile(activitiesactivities,scheduletorch.profiler.schedule(wait1,warmup1,active3,repeat1),record_shapesTrue,with_stackTrue,with_modulesTrue,with_flopsTrue,profile_memoryTrue,) as prof:for i in range(10):out cross_attn.forward(q1,k1,v1).sum()prof.step()torch.cuda.synchronize()if not use_cuda:kernel_event{}for ev in prof.profiler.function_events:if ev.privateuse1_time0:uuidf{ev.name}_{ev.id}_{ev.thread}#print(uuid,ev.id,ev.name,ev.privateuse1_time,ev.time_range.start,ev.time_range.end-ev.time_range.start,ev.privateuse1_memory_usage)kernel_event[uuid]{kernel_time:ev.privateuse1_time,device_memory_usage:ev.privateuse1_memory_usage,start_us:ev.time_range.start,host_dur:ev.time_range.end-ev.time_range.start,thread:ev.thread} import jsonwith open(fkernel_event_{rank}.json,w,encodingutf-8) as f:json.dump(kernel_event, f,ensure_asciiFalse,indent4)prof.export_chrome_trace(fprof_{rank}.json)merge_prof_timeline(fprof_{rank}.json,fkernel_event_{rank}.json,fprof_{rank}.json) else:#print(prof.key_averages().table(sort_byself_cpu_time_total))prof.export_chrome_trace(fprof_{q1.device.type}.json)4.可视化
http://www.pierceye.com/news/843895/

相关文章:

  • 网站设计需求分析报告做漫画的网站有哪些
  • 做什么网站吸引人sinaapp wordpress 固定链接
  • 东莞做网站怎么样搜狐综合小时报2022113011
  • 校园网站的意义融资渠道
  • 做网站上海公司自己制作一个网站需要什么软件
  • 铜川做网站电话app开发程序
  • 自助建微网站备案后修改网站名称
  • 免费网站正能量网站如何后台管理
  • 网站开发的质量标准网站如何做自适应
  • 黄南州wap网站建设公司wordpress里面怎么加链接
  • 五分钟自己创建网站的方法免费试用网站空间
  • 安徽平台网站建设找哪家辽宁建设工程信息网审核
  • 余姚住房和建设局网站10元备案域名购买
  • 企业网站制作公司盈利做支付行业招代理一般上什么网站
  • 网站制作电话wordpress支持PHP吗
  • 天津网站推广宣传拓者设计吧室内设计
  • 建设 信用中国 网站淘宝购物
  • 义乌论坛网站建设怎样建设智能网站
  • 重庆做网站 外包公司建设校园网站的必要性
  • 做我女朋友好不好套路网站html5网页设计实训总结
  • 怎样给网站登录界面做后台seo研究中心官网
  • 养生类网站源码dreamwear网页制作
  • 北京装修平台网站网页设计公司企业文化
  • 上海临平路网站建设网站建设设计制作方案与价格
  • seo三人行网站免费电商网站建设
  • seo蒙牛伊利企业网站专业性诊断.win域名做网站怎么样
  • 微信、网站提成方案点做网站建设当中的技术解决方案
  • 云南省住房和城乡建设厅官方网站网站哪里可以查到做ddos
  • 爱情动做网站推荐个人养老保险金怎么交
  • 淘客怎么做自己的网站演示动画制作免费网站