当前位置：首页 > news >正文

贝壳找房网站做销售steam交易链接在哪里

news 2025/11/13 18:31:43

贝壳找房网站做销售,steam交易链接在哪里,文章列表插件wordpress,百度做网站需要交钱吗note GLM将针对不同类型下游任务的预训练目标统一为了自回归填空#xff0c;结合了混合的注意力机制和新的二维位置编码。本文浅析sft#xff0c;并基于GLM在广告描述数据集上进行sftp-tuning代码的数据流讲解文章目录 note零、ChatGLM2模型一、Supervised fine-tuning1. 数…note GLM将针对不同类型下游任务的预训练目标统一为了自回归填空结合了混合的注意力机制和新的二维位置编码。本文浅析sft并基于GLM在广告描述数据集上进行sftp-tuning代码的数据流讲解文章目录 note零、ChatGLM2模型一、Supervised fine-tuning1. 数据样本的准备2. stanford_alpaca的dataset类3. 数据格式、计算loss的数据二、P-Tuning v2的数据流1. 数据准备2. P-Tuning v2微调3. 模型推理三、垂直领域训练注意事项Reference 零、ChatGLM2模型后面对chatglm2进行sft微调这里顺带着先介绍下glm2 chatglm-6bhttps://github.com/THUDM/ChatGLM-6Bchatglm2-6bhttps://github.com/THUDM/ChatGLM2-6Bchatglm130https://github.com/THUDM/GLM-130B 自回归空格填充任务初始文本输入x1, x2,x3,x4,x5,x6随机掩码mask PartA 部分x1,x2,M,x4,M 其中M表示mask的跨度PartB 部分 S,x5,x6,S,x3刚才mask掉的那几坨随机排序后对起始位置加入token拼接PartA和PartB 自注意机制chatglm在Q、K中加入了RoPE位置信息 Q W q X K W k X V W v X Attention ⁡ ( Q , K , V , A ) softmax ⁡ ( Q K T d k ) V \begin{aligned} Q W_q X \\ K W_k X \\ V W_v X \\ \operatorname{Attention}(Q, K, V, A) \operatorname{softmax}\left(\frac{Q K^T}{\sqrt{d_k}}\right) V \end{aligned} QKVAttention(Q,K,V,A)WqXWkXWvXsoftmax(dk QKT)Vmaskchatglm6b使用prefix-LM的mask对于输入的前缀使用双向注意力对于后续的生成部分则是causal mask PartA部分内的各token可以互相注意到PartB部分内的tokens可以注意到PartA和PartB中已经生成的token def get_masks(self, input_ids, past_key_values, padding_maskNone):batch_size, seq_length input_ids.shapefull_attention_mask torch.ones(batch_size, seq_length, seq_length, deviceinput_ids.device)full_attention_mask.tril_()past_length 0if past_key_values:past_length past_key_values[0][0].shape[0]if past_length:full_attention_mask torch.cat((torch.ones(batch_size, seq_length, past_length,deviceinput_ids.device), full_attention_mask), dim-1)if padding_mask is not None:full_attention_mask full_attention_mask * padding_mask.unsqueeze(1)if not past_length and padding_mask is not None:full_attention_mask - padding_mask.unsqueeze(-1) - 1full_attention_mask (full_attention_mask 0.5).bool()full_attention_mask.unsqueeze_(1)return full_attention_mask一、Supervised fine-tuning sft就是下面的第一个环节使用指令数据做有监督精调 (supervised fine-tuning)。 1. 数据样本的准备参考stanford_alpaca的sft 整理的数据有三列instruction、input、output。Instruction和input通过prompt组搞在一起为sourseoutput换为target把source和target和token.eos_token_id直接拼接在一起这个时候暂时叫sentence。然后把sentence通过tokenizer转换成input_ids。最后一步要把input_ids复制一份叫labels。然后把labels前面的位置即source对应的tokenid全部变成-100。那么这个时候一个面向sft任务的input_ids和labels就已经构造好了。在这个任务里面使用的就是transformers的DataCollatorForSeq2Seq。这个data_collator任务很简单就是让每一个batch内的input_ids和labels都长度对齐。 def _tokenize_fn(strings: Sequence[str], tokenizer: transformers.PreTrainedTokenizer) - Dict:Tokenize a list of strings.tokenized_list [tokenizer(text,return_tensorspt,paddinglongest,max_lengthtokenizer.model_max_length,truncationTrue,)for text in strings]input_ids labels [tokenized.input_ids[0] for tokenized in tokenized_list]input_ids_lens labels_lens [tokenized.input_ids.ne(tokenizer.pad_token_id).sum().item() for tokenized in tokenized_list]return dict(input_idsinput_ids,labelslabels,input_ids_lensinput_ids_lens,labels_lenslabels_lens,)def preprocess(sources: Sequence[str],targets: Sequence[str],tokenizer: transformers.PreTrainedTokenizer, ) - Dict:Preprocess the data by tokenizing.examples [s t for s, t in zip(sources, targets)]examples_tokenized, sources_tokenized [_tokenize_fn(strings, tokenizer) for strings in (examples, sources)]input_ids examples_tokenized[input_ids]labels copy.deepcopy(input_ids)for label, source_len in zip(labels, sources_tokenized[input_ids_lens]):label[:source_len] IGNORE_INDEXreturn dict(input_idsinput_ids, labelslabels)2. stanford_alpaca的dataset类 class SupervisedDataset(Dataset):Dataset for supervised fine-tuning.def __init__(self, data_path: str, tokenizer: transformers.PreTrainedTokenizer):super(SupervisedDataset, self).__init__()logging.warning(Loading data...)list_data_dict utils.jload(data_path)logging.warning(Formatting inputs...)prompt_input, prompt_no_input PROMPT_DICT[prompt_input], PROMPT_DICT[prompt_no_input]sources [prompt_input.format_map(example) if example.get(input, ) ! else prompt_no_input.format_map(example)for example in list_data_dict]targets [f{example[output]}{tokenizer.eos_token} for example in list_data_dict]logging.warning(Tokenizing inputs... This may take some time...)data_dict preprocess(sources, targets, tokenizer)self.input_ids data_dict[input_ids]self.labels data_dict[labels]def __len__(self):return len(self.input_ids)def __getitem__(self, i) - Dict[str, torch.Tensor]:return dict(input_idsself.input_ids[i], labelsself.labels[i])3. 数据格式、计算loss的数据简单分析hf的trainerhugggingface自带的trainer类中参数如下 def __init__(self,model: Union[PreTrainedModel, nn.Module] None,args: TrainingArguments None,data_collator: Optional[DataCollator] None,train_dataset: Optional[Dataset] None,eval_dataset: Optional[Union[Dataset, Dict[str, Dataset]]] None,tokenizer: Optional[PreTrainedTokenizerBase] None,model_init: Optional[Callable[[], PreTrainedModel]] None,compute_metrics: Optional[Callable[[EvalPrediction], Dict]] None,callbacks: Optional[List[TrainerCallback]] None,optimizers: Tuple[torch.optim.Optimizer, torch.optim.lr_scheduler.LambdaLR] (None, None),preprocess_logits_for_metrics: Optional[Callable[[torch.Tensor, torch.Tensor], torch.Tensor]] None,)hf自带的trainer中的compute_loss函数中用到了标签平滑的正则化将真实标签的概率分布进行平滑减少模型过拟合对于上面trainer类的参数data_collator对于encoder和decoder模型都是不同的比如前者的bert模型用于ner词性标注任务时 BertForTokenClassification((bert): BertModel((embeddings): BertEmbeddings((word_embeddings): Embedding(28996, 768, padding_idx0)(position_embeddings): Embedding(512, 768)(token_type_embeddings): Embedding(2, 768)(LayerNorm): LayerNorm((768,), eps1e-12, elementwise_affineTrue)(dropout): Dropout(p0.1, inplaceFalse))(encoder): BertEncoder((layer): ModuleList((0-11): 12 x BertLayer((attention): BertAttention((self): BertSelfAttention((query): Linear(in_features768, out_features768, biasTrue)(key): Linear(in_features768, out_features768, biasTrue)(value): Linear(in_features768, out_features768, biasTrue)(dropout): Dropout(p0.1, inplaceFalse))(output): BertSelfOutput((dense): Linear(in_features768, out_features768, biasTrue)(LayerNorm): LayerNorm((768,), eps1e-12, elementwise_affineTrue)(dropout): Dropout(p0.1, inplaceFalse)))(intermediate): BertIntermediate((dense): Linear(in_features768, out_features3072, biasTrue)(intermediate_act_fn): GELUActivation())(output): BertOutput((dense): Linear(in_features3072, out_features768, biasTrue)(LayerNorm): LayerNorm((768,), eps1e-12, elementwise_affineTrue)(dropout): Dropout(p0.1, inplaceFalse))))))(dropout): Dropout(p0.1, inplaceFalse)(classifier): Linear(in_features768, out_features9, biasTrue) )对于上面这种encoder模型的data_collator定义如下比如我们要做文本分类每个数据样本即对应一个文本序列一个label from transformers import BertTokenizer from torch.utils.data import DataLoadertokenizer BertTokenizer.from_pretrained(bert-base-cased)class TextClassificationDataset(Dataset):def __init__(self, texts, labels, max_length):self.texts textsself.labels labelsself.max_length max_lengthdef __len__(self):return len(self.texts)def __getitem__(self, index):text self.texts[index]label self.labels[index]input_ids, attention_mask self.encode_text(text)return {input_ids: input_ids, attention_mask: attention_mask, label: label}def encode_text(self, text):input_ids tokenizer.encode(text, add_special_tokensTrue, max_lengthself.max_length, truncationTrue)attention_mask [1] * len(input_ids)padding_length self.max_length - len(input_ids)input_ids input_ids [0] * padding_lengthattention_mask attention_mask [0] * padding_lengthreturn input_ids, attention_maskdef collate_fn(batch):input_ids [item[input_ids] for item in batch]attention_mask [item[attention_mask] for item in batch]labels [item[label] for item in batch]label_map {label: i for i, label in enumerate(set(labels))}encoded_labels [label_map[label] for label in labels]return {input_ids: input_ids, attention_mask: attention_mask, labels: encoded_labels}train_dataset TextClassificationDataset(textstrain_texts, labelstrain_labels, max_length128) train_dataloader DataLoader(train_dataset, batch_size32, collate_fncollate_fn)后者的gpt是next token prediction以chatglm为栗子特点是加入了ROPE旋转位置编码、使用RMSNorm正则化等操作 ChatGLMForConditionalGeneration((transformer): ChatGLMModel((embedding): Embedding((word_embeddings): Embedding(65024, 4096))(rotary_pos_emb): RotaryEmbedding()(encoder): GLMTransformer((layers): ModuleList((0-27): 28 x GLMBlock((input_layernorm): RMSNorm()(self_attention): SelfAttention((query_key_value): QuantizedLinear()(core_attention): CoreAttention((attention_dropout): Dropout(p0.0, inplaceFalse))(dense): QuantizedLinear())(post_attention_layernorm): RMSNorm()(mlp): MLP((dense_h_to_4h): QuantizedLinear()(dense_4h_to_h): QuantizedLinear())))(final_layernorm): RMSNorm())(output_layer): Linear(in_features4096, out_features65024, biasFalse)) )对于上面这种decoder模型我们的data_collator定义如下TextGenerationDataset对输入文本进行编码并且将目标序列往后移动一位以便预测dataloader将数据集分为多个mini-batchcollate_fn函数对每个mini-batch数据进行自定义组合 from transformers import GPT2Tokenizer from torch.utils.data import Dataset, DataLoader from torch.nn.utils.rnn import pad_sequencetokenizer GPT2Tokenizer.from_pretrained(gpt2)class TextGenerationDataset(Dataset):def __init__(self, texts, max_length):self.texts textsself.max_length max_lengthdef __len__(self):return len(self.texts)def __getitem__(self, index):text self.texts[index]input_ids, attention_mask self.encode_text(text)return {input_ids: input_ids, attention_mask: attention_mask}def encode_text(self, text):input_ids tokenizer.encode(text, add_special_tokensTrue, max_lengthself.max_length, truncationTrue)attention_mask [1] * len(input_ids)padding_length self.max_length - len(input_ids)input_ids input_ids [tokenizer.pad_token_id] * padding_lengthattention_mask attention_mask [0] * padding_lengthreturn input_ids, attention_maskdef collate_fn(batch):input_ids [torch.tensor(item[input_ids], dtypetorch.long) for item in batch]attention_mask [torch.tensor(item[attention_mask], dtypetorch.long) for item in batch]input_ids pad_sequence(input_ids, batch_firstTrue, padding_valuetokenizer.pad_token_id)attention_mask pad_sequence(attention_mask, batch_firstTrue, padding_value0)return {input_ids: input_ids, attention_mask: attention_mask}train_dataset TextGenerationDataset(textstrain_texts, max_length128) train_dataloader DataLoader(train_dataset, batch_size32, collate_fncollate_fn)在计算loss时也会将input_id向后移动1位后作为label比如下面chatglm2-6b的源码 lm_logits初始[batch_size, sequence_length, vocab_size]减1是将最后一个位置的预测结果去掉labels初始[batch_size, sequence_length]下面代码的labels[..., 1:]即第1个0开始计算维度从位置1开始取最后计算nll_loss和smoothed_loss的加权和作为loss值 def __call__(self, model_output, labels, shift_labelsFalse):logits model_output[logits] if isinstance(model_output, dict) else model_output[0]if shift_labels:logits logits[..., :-1, :].contiguous()labels labels[..., 1:].contiguous()log_probs -nn.functional.log_softmax(logits, dim-1)if labels.dim() log_probs.dim() - 1:labels labels.unsqueeze(-1)padding_mask labels.eq(self.ignore_index)# In case the ignore_index is -100, the gather will fail, so we replace labels by 0. The padding_mask# will ignore them in any case.labels torch.clamp(labels, min0)nll_loss log_probs.gather(dim-1, indexlabels)# works for fp16 input tensor too, by internally upcasting it to fp32smoothed_loss log_probs.sum(dim-1, keepdimTrue, dtypetorch.float32)nll_loss.masked_fill_(padding_mask, 0.0)smoothed_loss.masked_fill_(padding_mask, 0.0)# Take the mean over the label dimensions, then divide by the number of active elements (i.e. not-padded):num_active_elements padding_mask.numel() - padding_mask.long().sum()nll_loss nll_loss.sum() / num_active_elementssmoothed_loss smoothed_loss.sum() / (num_active_elements * log_probs.shape[-1])return (1 - self.epsilon) * nll_loss self.epsilon * smoothed_loss二、P-Tuning v2的数据流项目基于广告数据集chatglm2的p-tuning v2微调背景sft指令微调时为了加快训练 Parameter-Efficient Model Adaptation所以经常伴随着prompt tuning、lora tuning、p-tuning v2等peft操作。下面以chatglm2官方的p-tuning v2为例介绍。 1. 数据准备 ADGEN 数据集任务为根据输入content生成一段广告词summary。 {content: 类型#上衣*版型#宽松*版型#显瘦*图案#线条*衣样式#衬衫*衣袖型#泡泡袖*衣款式#抽绳,summary: 这件衬衫的款式非常的宽松利落的线条可以很好的隐藏身材上的小缺点穿在身上有着很好的显瘦效果。领口装饰了一个可爱的抽绳漂亮的绳结展现出了十足的个性配合时尚的泡泡袖型尽显女性甜美可爱的气息。 }2. P-Tuning v2微调 train.sh 中的 PRE_SEQ_LEN 和 LR 分别是 soft prompt 长度和训练的学习率可以进行调节以取得最佳的效果。P-Tuning-v2 方法会冻结全部的模型参数可通过调整 quantization_bit 来被原始模型的量化等级不加此选项则为 FP16 精度加载。glm2源码中的data_collator使用了transformer的DataCollatorForSeq2Seq其实很多时候直接用这个就行了不用像上面1.1中的stanford_alpaca中一样去自定义 from transformers import DataCollatorForSeq2Seqdata_collator DataCollatorForSeq2Seq(tokenizer,modelmodel,label_pad_token_idlabel_pad_token_id,pad_to_multiple_ofNone,paddingFalse)3. 模型推理在 P-tuning v2 训练时模型只保存 PrefixEncoder 部分的参数继承了trainer的PrefixTrainer类重写了父类的_save函数所以在推理时需要同时加载原 ChatGLM2-6B 模型以及 PrefixEncoder 的权重因此需要指定 evaluate.sh 中的参数 --model_name_or_path THUDM/chatglm2-6b --ptuning_checkpoint $CHECKPOINT_PATH如果是只需要跟之前一样设定 model_name_or_path --model_name_or_path $CHECKPOINT_PATH评测指标为中文 Rouge score 和 BLEU-4。三、垂直领域训练注意事项如PowerLawGLM的训练。怎么做法律条文的准确输出对pretrained 进行法文垂直场景的增量微调。收集大量的法文数据(网络离线)对GLM130基座模型进行增量预训练先把垂直场景的法律条文数据注入到pretrained 模型。对齐法律场景对齐用监督数据做SFT微调(百万级真实法律条文的数据)。解决出现的幻觉问题输出不存在的法律条文: 重点工程优化比如做基于检索的条文输出可参考LangchainLLM的输出方式外挂知识库。 Reference [1] GLM General Language Model Pretraining with Autoregressive Blank Inﬁlling ACL2022 [2] https://aclanthology.org/2022.acl-long.26.pdf [3] GLM论文精读-自回归填空的通用语言模型

查看全文

http://www.pierceye.com/news/126002/