网站群集约化建设,cc域名做门户网站,小程序页面设计图,做网站大约需要多少钱吴恩达LLM-Huggingface_哔哩哔哩_bilibili
目录
0. huggingface 根据需求寻找开源模型
1. Whisper模型 语音识别任务
2. blenderbot 聊天机器人
3. 文本翻译模型translator
4. BART 模型摘要器#xff08;summarizer#xff09;
5. sentence-transformers 句子相似度 …吴恩达LLM-Huggingface_哔哩哔哩_bilibili
目录
0. huggingface 根据需求寻找开源模型
1. Whisper模型 语音识别任务
2. blenderbot 聊天机器人
3. 文本翻译模型translator
4. BART 模型摘要器summarizer
5. sentence-transformers 句子相似度 0. huggingface 根据需求寻找开源模型
https://huggingface.co/models 可以在huggingface官网上找对应的模型
根据任务task(CV NLP 多模态之类) language 等指标进行筛选。 还可以在右上角的Tasks里 了解各种机器学习任务 pipeline 是一个来自 Hugging Face Transformers 库的高级接口。 可以快速调用预训练模型完成常见任务比如文本分类、翻译、摘要、问答、语音识别等等。调用方式如下
from transformers import pipeline
我们后续会进行一些示例的调用 系统先会进行模型的下载和保存建议事先设置一下环境变量HF_HOME 到某一希望保存的路径比如 D:\huggingface_cache 。 1. Whisper模型 语音识别任务 例如在tasks中挑选了一个 语音识别任务 Automatic Speech Recognition
打开网址的右侧会有一些 模型和数据集 选择第一个 openai的模型之后 右上角的Use this model 展示如何调用这个模型 为了能够读取音频 还需要安装一下ffmpeg 以下为一个release版本的安装包
https://www.gyan.dev/ffmpeg/builds/ffmpeg-release-essentials.zip
再将 ffmpeg/bin 文件夹路径添加到 系统环境变量的 PATH 中
并在cmd 中 ffmpeg -version 验证安装成功。 然后就可以进行直接调用 modelopenai/whisper-large-v3
from transformers import pipeline
pipe pipeline(automatic-speech-recognition,modelopenai/whisper-large-v3,frameworkpt, # 使用 PyTorch 框架chunk_length_s30 # 每段音频的最大长度秒)
result pipe(audio.m4a) # 自动识别语言转换
print(result)
# {text: 我爱南京大学。}result pipe(audio.m4a,generate_kwargs{task: translate}) # 音频转文字并翻译为英文
print(result)
# {text: I love Nanjing University.}
如果要设置一些其他的参数 可以看model的Usage解释
比如 用generate_kwargs language指定源语言不指定则自动预测 translate可以翻译为英语 还可以设定一些其他参数 比如长度、束搜索、温度、声音大小阈值、概率对数阈值等
generate_kwargs {max_new_tokens: 448, # 最大生成长度num_beams: 1, # 束搜索宽度condition_on_prev_tokens: False, # 是否依赖前tokentemperature: (0.0, 0.2, 0.4, 0.6, 0.8, 1.0), # 温度生成多样性logprob_threshold: -1.0, # 概率阈值no_speech_threshold: 0.6, # 静音阈值return_timestamps: True, # 返回时间戳
}# 调用管道
result pipe(sample, generate_kwargsgenerate_kwargs)
print(result) 2. blenderbot 聊天机器人
https://huggingface.co/models?otherblenderbotsorttrending 一些blenderbot模型
https://huggingface.co/facebook/blenderbot-400M-distill
使用预训练的分词器和模型 message - 分词器encode - model - 分词器decode
模型参数量400M较小效果不太好。 简单版调用
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM# 加载分词器和模型
tokenizer AutoTokenizer.from_pretrained(facebook/blenderbot-400M-distill)
model AutoModelForSeq2SeqLM.from_pretrained(facebook/blenderbot-400M-distill)# 用户输入
user_message Good morning.# 编码输入
inputs tokenizer(user_message, return_tensorspt).to(model.device)# 生成回复
outputs model.generate(**inputs, max_new_tokens40)# 解码输出
print( Bot:, tokenizer.decode(outputs[0], skip_special_tokensTrue))
# Bot: Good morning to you as well. How is your morning going so far? Do you have any plans?
想实现上下文的记忆性就要开一个字符串context把之前对话记录下来 一起作为input
还可以再对分词器和模型 分别加一些参数设置
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torchtokenizer AutoTokenizer.from_pretrained(facebook/blenderbot-400M-distill)
model AutoModelForSeq2SeqLM.from_pretrained(facebook/blenderbot-400M-distill).to(cuda if torch.cuda.is_available() else cpu
)# 轮次分隔符
eos tokenizer.eos_token or /s
context # 全局对话记录上下文def chat_once(user_text, max_new_tokens80):global context # 声明使用上面的全局 context# 构造提示user 一句 以 bot: 结尾便于模型续写context fuser: {user_text}{eos}bot:inputs tokenizer(context,return_tensorspt,truncationTrue,max_length1024 # 防止过长).to(model.device)outputs model.generate(**inputs,max_new_tokensmax_new_tokens,do_sampleTrue,temperature0.7,top_p0.9,num_beams1,no_repeat_ngram_size3)# 解码整段然后取出最后一个 bot: 之后的内容作为回复whole tokenizer.decode(outputs[0], skip_special_tokensTrue)reply whole.split(bot:)[-1].strip()# 把本轮回复写回上下文并加分隔符context f {reply}{eos}return replyprint(, chat_once(Good morning.))
print(, chat_once(What can you do?))
print(, chat_once(Recommend a movie for tonight.)) Good morning! I hope you had a good day today. Do you have any plans?I am going to go on a vacation to visit my family! I cant wait!Good morning, what movie are you going to see? Ive got plans for this weekend. 3. 文本翻译模型translator
NLLB-200 Distilled 600M 模型 200 种语言互译
https://huggingface.co/facebook/nllb-200-distilled-600M?librarytransformers
from transformers import pipeline
import torch# 加载翻译模型
translator pipeline(tasktranslation,modelfacebook/nllb-200-distilled-600M,torch_dtypetorch.bfloat16 # 如果你的显卡支持 bfloat16
)
# 要翻译的文本
text My puppy is adorable. Your kitten is cute. Her panda is friendly. His llama is thoughtful. We all have nice pets!# 翻译从英文 - 法语
text_translated translator(text,src_langeng_Latn, # 源语言英语tgt_langfra_Latn # 目标语言法语
)
print(text_translated)# [{translation_text: Mon chiot est adorable, ton chaton est mignon, son panda est ami, sa lamme est attentive, nous avons tous de beaux animaux de compagnie.}]# 翻译从英文 - 中文
text_translated translator(text,src_langeng_Latn, # 源语言英语tgt_langzho_Hans # 目标语言中文
)
print(text_translated)# [{translation_text: 我的狗很可爱,你的小猫很可爱,她的熊猫很友好,他的拉马很有心情.我们都有好物!}] 4. BART 模型摘要器summarizer
https://huggingface.co/facebook/bart-large-cnn
from transformers import pipeline
import torch# 创建摘要器summarizer用 BART 模型
summarizer pipeline(tasksummarization,modelfacebook/bart-large-cnn,frameworkpt, # 若只用 PyTorch不加载 TensorFlowtorch_dtypetorch.bfloat16
)# 输入要摘要的文本南京大学介绍
text Nanjing University, located in Nanjing, Jiangsu Province, China,
is one of the oldest and most prestigious institutions of higher
learning in China. It traces its history back to 1902 and has played
a significant role in modern Chinese education. The university is
known for its strong programs in sciences, engineering, humanities,
and social sciences. It has a large number of distinguished alumni
and is recognized as a member of Chinas Double First-Class initiative.
The main campuses are located in Gulou and Xianlin, offering a modern
learning and research environment for both domestic and international students.# 执行摘要 设置长度范围
summary summarizer(text,min_length10,max_length80
)print(summary[0][summary_text])
# Nanjing University is one of the oldest and most prestigious universities in China. It is known for its strong programs in sciences, engineering, and humanities.5. sentence-transformers 句子相似度
https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2
句子转化为embedding 并用向量余弦值求相似度
from sentence_transformers import SentenceTransformer,util# 加载模型
model SentenceTransformer(all-MiniLM-L6-v2)# 待编码的句子 1
sentences1 [The cat sits outside,A man is playing guitar,The movies are awesome
]
embeddings1 model.encode(sentences1, convert_to_tensorTrue) # 编码得到向量
print(embeddings1)# 待编码的句子 2
sentences2 [The dog plays in the garden,A woman watches TV,The new movie is so great
]
embeddings2 model.encode(sentences2, convert_to_tensorTrue) # 编码得到向量
print(embeddings2)
print(util.cos_sim(embeddings1, embeddings2)) 6. Zero-Shot Audio Classification 零样本音频分类 https://huggingface.co/laion/clap-htsat-unfused
7. Text-to-Speech 文字转语音 https://huggingface.co/kakao-enterprise/vits-ljs