Referencias bibliográficas
Selección curada de papers, frameworks y guías citados en el Manual QA AI v13 (Apéndice B). Formato canónico: autor (año). título. venue. arXiv:ID o URL.
URLs verificadas
URLs verificadas a fecha de edición (2026-04). El ecosistema cambia rápido; revisar si las URLs siguen vigentes antes de citar.
Estándares y guías
- OWASP. (2025). OWASP Top 10 for LLM Applications. genai.owasp.org/llm-top-10
- NIST. (2023). AI Risk Management Framework (AI RMF 1.0). nist.gov/itl/ai-risk-management-framework
- EU. (2024). Artificial Intelligence Act. Reglamento UE 2024/1689. digital-strategy.ec.europa.eu/policies/ai-act
- ISO/IEC. (2023). ISO/IEC 42001:2023 — AI Management Systems.
- ISO/IEC. (2022). ISO/IEC 23053:2022 — Framework for AI Systems Using ML.
- ISO/IEC. (2023). ISO/IEC 23894:2023 — Guidance on Risk Management for AI.
- ISTQB. (2019). Certified Tester AI Testing (CT-AI) — Syllabus v1.0. istqb.org. Edición original; revisar la versión vigente.
- ISTQB. (2025). Certified Tester Generative AI Testing (CT-GenAI) — Syllabus v1.0. Julio 2025; traducción ES diciembre 2025.
Fundamentos de IA y arquitectura de LLMs
- Russell, S., Norvig, P. (2021). Artificial Intelligence: A Modern Approach. 4ª ed. Pearson.
- Vaswani, A. et al. (2017). Attention Is All You Need. NeurIPS. arXiv:1706.03762.
Frameworks y métricas RAG
- Es, S. et al. (2023). RAGAS: Automated Evaluation of Retrieval Augmented Generation. arXiv:2309.15217.
- Saad-Falcon, J. et al. (2023). ARES: An Automated Evaluation Framework for RAG. arXiv:2311.09476.
- Gao, Y. et al. (2023). Retrieval-Augmented Generation for Large Language Models: A Survey. arXiv:2312.10997.
- Gao, L. et al. (2022). Precise Zero-Shot Dense Retrieval without Relevance Labels (HyDE). arXiv:2212.10496.
- Asai, A. et al. (2023). Self-RAG: Learning to Retrieve, Generate, and Critique. arXiv:2310.11511.
- Pradeep, R. et al. (2021). The Expando-Mono-Duo Design Pattern for Text Ranking. arXiv:2101.05667.
LLM-as-Judge y evaluación
- Zheng, L. et al. (2023). Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. NeurIPS. arXiv:2306.05685.
- Liu, Y. et al. (2023). G-Eval: NLG Evaluation using GPT-4. arXiv:2303.16634.
- Chen, C. et al. (2024). Humans or LLMs as the Judge? A Study on Judgement Bias. arXiv:2402.10669.
Seguridad y prompt injection
- Greshake, K., Abdelnabi, S., Mishra, S. et al. (2023). Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection. arXiv:2302.12173.
- Perez, F., Ribeiro, I. (2022). Ignore Previous Prompt: Attack Techniques For Language Models. arXiv:2211.09527.
- Wei, A. et al. (2023). Jailbroken: How Does LLM Safety Training Fail?. NeurIPS. arXiv:2307.02483.
Bias, fairness y safety
- Suresh, H., Guttag, J. (2021). A Framework for Understanding Sources of Harm Throughout the ML Life Cycle. EAAMO. arXiv:1901.10002.
- Gallegos, I. O. et al. (2024). Bias and Fairness in Large Language Models: A Survey. Computational Linguistics. arXiv:2309.00770.
- Bender, E. et al. (2021). On the Dangers of Stochastic Parrots. FAccT 2021.
- Mitchell, M. et al. (2019). Model Cards for Model Reporting. FAT*.
- Gebru, T. et al. (2021). Datasheets for Datasets. Communications of the ACM. arXiv:1803.09010.
Alucinaciones y reasoning errors
- Ji, Z. et al. (2023). Survey of Hallucination in Natural Language Generation. ACM Computing Surveys.
- Manakul, P. et al. (2023). SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection. arXiv:2303.08896.
- Huang, L. et al. (2023). A Survey on Hallucination in LLMs. arXiv:2311.05232.
- Mirzadeh, I. et al. (2024). GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models. arXiv:2410.05229.
Robustness
- Ribeiro, M. et al. (2020). Beyond Accuracy: Behavioral Testing of NLP Models with CheckList. ACL.
- Zhu, K. et al. (2023). PromptBench: Towards Evaluating the Robustness of Large Language Models on Adversarial Prompts. arXiv:2306.04528.
- Morris, J. et al. (2020). TextAttack: A Framework for Adversarial Attacks, Data Augmentation, and Adversarial Training in NLP. EMNLP.
Privacy y PII
- Carlini, N. et al. (2021). Extracting Training Data from Large Language Models. USENIX Security.
- Yao, Y. et al. (2024). LLM-PBE: Assessing Data Privacy in Large Language Models. arXiv:2408.12787.
- Microsoft. Presidio — PII detection and anonymization. github.com/microsoft/presidio.
Inter-annotator agreement
- Landis, J., Koch, G. (1977). The Measurement of Observer Agreement for Categorical Data. Biometrics 33(1).
- Krippendorff, K. (2018). Content Analysis: An Introduction to Its Methodology. 4ª ed. SAGE.
- Hayes, A., Krippendorff, K. (2007). Answering the Call for a Standard Reliability Measure for Coding Data. Communication Methods.
Consumo energético e impacto medioambiental
- Luccioni, A. S., Jernite, Y., Strubell, E. (2024a). Power Hungry Processing: Watts Driving the Cost of AI Deployment?. FAccT 2024. arXiv:2311.16863.
- Berthelot, A. et al. (2024). Estimating the environmental impact of Generative-AI services using an LCA-based methodology. arXiv:2401.14878.
Frameworks y herramientas
| Herramienta | Repositorio |
|---|---|
| RAGAS | github.com/explodinggradients/ragas |
| TruLens | github.com/truera/trulens |
| DeepEval | github.com/confident-ai/deepeval |
| Langfuse | github.com/langfuse/langfuse |
| Arize Phoenix | github.com/Arize-ai/phoenix |
| Argilla | github.com/argilla-io/argilla |
| ranx | github.com/AmenRa/ranx |
| PromptBench | github.com/microsoft/promptbench |
| TextAttack | github.com/QData/TextAttack |
| NL-Augmenter | github.com/GEM-benchmark/NL-Augmenter |
| LlamaGuard (Meta) | huggingface.co/meta-llama/LlamaGuard-7b |
| NeMo Guardrails (NVIDIA) | github.com/NVIDIA/NeMo-Guardrails |
| jsonschema (Python) | github.com/python-jsonschema/jsonschema |
Cómo citar este manual
Si usas el Manual QA AI v13 o este laboratorio en un paper, charla o curso:
Moreno Cominero, G. (2026). Manual de QA para Sistemas de Inteligencia Artificial.
Versión 13. Edición 2026. ai-testing-lab.vercel.app.Cómo verificar las referencias
Las URLs y arXiv IDs se verifican a fecha de edición. Si encuentras un enlace roto o un paper retirado, abre un issue en github.com/gonzaloMorenoc/ai-testing-lab/issues con el ID afectado.