Referencias bibliográficas

Selección curada de papers, frameworks y guías citados en el Manual QA AI v13 (Apéndice B). Formato canónico: autor (año). título. venue. arXiv:ID o URL.

URLs verificadas

URLs verificadas a fecha de edición (2026-04). El ecosistema cambia rápido; revisar si las URLs siguen vigentes antes de citar.

Estándares y guías

OWASP. (2025). OWASP Top 10 for LLM Applications. genai.owasp.org/llm-top-10
NIST. (2023). AI Risk Management Framework (AI RMF 1.0). nist.gov/itl/ai-risk-management-framework
EU. (2024). Artificial Intelligence Act. Reglamento UE 2024/1689. digital-strategy.ec.europa.eu/policies/ai-act
ISO/IEC. (2023). ISO/IEC 42001:2023 — AI Management Systems.
ISO/IEC. (2022). ISO/IEC 23053:2022 — Framework for AI Systems Using ML.
ISO/IEC. (2023). ISO/IEC 23894:2023 — Guidance on Risk Management for AI.
ISTQB. (2019). Certified Tester AI Testing (CT-AI) — Syllabus v1.0. istqb.org. Edición original; revisar la versión vigente.
ISTQB. (2025). Certified Tester Generative AI Testing (CT-GenAI) — Syllabus v1.0. Julio 2025; traducción ES diciembre 2025.

Fundamentos de IA y arquitectura de LLMs

Russell, S., Norvig, P. (2021). Artificial Intelligence: A Modern Approach. 4ª ed. Pearson.
Vaswani, A. et al. (2017). Attention Is All You Need. NeurIPS. arXiv:1706.03762.

Frameworks y métricas RAG

Es, S. et al. (2023). RAGAS: Automated Evaluation of Retrieval Augmented Generation. arXiv:2309.15217.
Saad-Falcon, J. et al. (2023). ARES: An Automated Evaluation Framework for RAG. arXiv:2311.09476.
Gao, Y. et al. (2023). Retrieval-Augmented Generation for Large Language Models: A Survey. arXiv:2312.10997.
Gao, L. et al. (2022). Precise Zero-Shot Dense Retrieval without Relevance Labels (HyDE). arXiv:2212.10496.
Asai, A. et al. (2023). Self-RAG: Learning to Retrieve, Generate, and Critique. arXiv:2310.11511.
Pradeep, R. et al. (2021). The Expando-Mono-Duo Design Pattern for Text Ranking. arXiv:2101.05667.

LLM-as-Judge y evaluación

Zheng, L. et al. (2023). Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. NeurIPS. arXiv:2306.05685.
Liu, Y. et al. (2023). G-Eval: NLG Evaluation using GPT-4. arXiv:2303.16634.
Chen, C. et al. (2024). Humans or LLMs as the Judge? A Study on Judgement Bias. arXiv:2402.10669.

Seguridad y prompt injection

Greshake, K., Abdelnabi, S., Mishra, S. et al. (2023). Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection. arXiv:2302.12173.
Perez, F., Ribeiro, I. (2022). Ignore Previous Prompt: Attack Techniques For Language Models. arXiv:2211.09527.
Wei, A. et al. (2023). Jailbroken: How Does LLM Safety Training Fail?. NeurIPS. arXiv:2307.02483.

Bias, fairness y safety

Suresh, H., Guttag, J. (2021). A Framework for Understanding Sources of Harm Throughout the ML Life Cycle. EAAMO. arXiv:1901.10002.
Gallegos, I. O. et al. (2024). Bias and Fairness in Large Language Models: A Survey. Computational Linguistics. arXiv:2309.00770.
Bender, E. et al. (2021). On the Dangers of Stochastic Parrots. FAccT 2021.
Mitchell, M. et al. (2019). Model Cards for Model Reporting. FAT*.
Gebru, T. et al. (2021). Datasheets for Datasets. Communications of the ACM. arXiv:1803.09010.

Alucinaciones y reasoning errors

Ji, Z. et al. (2023). Survey of Hallucination in Natural Language Generation. ACM Computing Surveys.
Manakul, P. et al. (2023). SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection. arXiv:2303.08896.
Huang, L. et al. (2023). A Survey on Hallucination in LLMs. arXiv:2311.05232.
Mirzadeh, I. et al. (2024). GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models. arXiv:2410.05229.

Robustness

Ribeiro, M. et al. (2020). Beyond Accuracy: Behavioral Testing of NLP Models with CheckList. ACL.
Zhu, K. et al. (2023). PromptBench: Towards Evaluating the Robustness of Large Language Models on Adversarial Prompts. arXiv:2306.04528.
Morris, J. et al. (2020). TextAttack: A Framework for Adversarial Attacks, Data Augmentation, and Adversarial Training in NLP. EMNLP.

Privacy y PII

Carlini, N. et al. (2021). Extracting Training Data from Large Language Models. USENIX Security.
Yao, Y. et al. (2024). LLM-PBE: Assessing Data Privacy in Large Language Models. arXiv:2408.12787.
Microsoft. Presidio — PII detection and anonymization. github.com/microsoft/presidio.

Inter-annotator agreement

Landis, J., Koch, G. (1977). The Measurement of Observer Agreement for Categorical Data. Biometrics 33(1).
Krippendorff, K. (2018). Content Analysis: An Introduction to Its Methodology. 4ª ed. SAGE.
Hayes, A., Krippendorff, K. (2007). Answering the Call for a Standard Reliability Measure for Coding Data. Communication Methods.

Consumo energético e impacto medioambiental

Luccioni, A. S., Jernite, Y., Strubell, E. (2024a). Power Hungry Processing: Watts Driving the Cost of AI Deployment?. FAccT 2024. arXiv:2311.16863.
Berthelot, A. et al. (2024). Estimating the environmental impact of Generative-AI services using an LCA-based methodology. arXiv:2401.14878.

Frameworks y herramientas

Herramienta	Repositorio
RAGAS	github.com/explodinggradients/ragas
TruLens	github.com/truera/trulens
DeepEval	github.com/confident-ai/deepeval
Langfuse	github.com/langfuse/langfuse
Arize Phoenix	github.com/Arize-ai/phoenix
Argilla	github.com/argilla-io/argilla
ranx	github.com/AmenRa/ranx
PromptBench	github.com/microsoft/promptbench
TextAttack	github.com/QData/TextAttack
NL-Augmenter	github.com/GEM-benchmark/NL-Augmenter
LlamaGuard (Meta)	huggingface.co/meta-llama/LlamaGuard-7b
NeMo Guardrails (NVIDIA)	github.com/NVIDIA/NeMo-Guardrails
jsonschema (Python)	github.com/python-jsonschema/jsonschema

Cómo citar este manual

Si usas el Manual QA AI v13 o este laboratorio en un paper, charla o curso:

Moreno Cominero, G. (2026). Manual de QA para Sistemas de Inteligencia Artificial.
Versión 13. Edición 2026. ai-testing-lab.vercel.app.

Cómo verificar las referencias

Las URLs y arXiv IDs se verifican a fecha de edición. Si encuentras un enlace roto o un paper retirado, abre un issue en github.com/gonzaloMorenoc/ai-testing-lab/issues con el ID afectado.

Referencias bibliográficas ​

Estándares y guías ​

Fundamentos de IA y arquitectura de LLMs ​

Frameworks y métricas RAG ​

LLM-as-Judge y evaluación ​

Seguridad y prompt injection ​

Bias, fairness y safety ​

Alucinaciones y reasoning errors ​

Robustness ​

Privacy y PII ​

Inter-annotator agreement ​

Consumo energético e impacto medioambiental ​

Frameworks y herramientas ​

Cómo citar este manual ​

Cómo verificar las referencias ​